[Intel-gfx] ✗ Fi.CI.BUILD: failure for GuC submission / DRM scheduler integration plan + new uAPI

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for GuC submission / DRM scheduler integration plan + new uAPI
  2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
  (?)
@ 2021-05-06 17:27 ` Patchwork
  -1 siblings, 0 replies; 41+ messages in thread
From: Patchwork @ 2021-05-06 17:27 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: GuC submission / DRM scheduler integration plan + new uAPI
URL   : https://patchwork.freedesktop.org/series/89840/
State : failure

== Summary ==

Applying: drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
Using index info to reconstruct a base tree...
M	Documentation/gpu/rfc/index.rst
Falling back to patching base and 3-way merge...
Auto-merging Documentation/gpu/rfc/index.rst
CONFLICT (content): Merge conflict in Documentation/gpu/rfc/index.rst
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 0/5] GuC submission / DRM scheduler integration plan + new uAPI
@ 2021-05-06 17:30 ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: matthew.brost, tony.ye, tvrtko.ursulin, daniele.ceraolospurio,
	carl.zhang, jason.ekstrand, jon.bloomfield, daniel.vetter,
	john.c.harrison

Subject and patches say it all.

Initial post of GuC submission patches, detailed in patch 1, coming
shortly.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (5):
  drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
  drm/doc/rfc: i915 new parallel submission uAPI plan
  drm/i915: Expose logical engine instance to user
  drm/i915: Introduce 'set parallel submit' extension
  drm/i915: Update execbuf IOCTL to accept N BBs

 Documentation/gpu/rfc/i915_scheduler.rst | 122 ++++++++++++++++++
 Documentation/gpu/rfc/index.rst          |   4 +
 include/uapi/drm/i915_drm.h              | 154 ++++++++++++++++++++++-
 3 files changed, 278 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst

-- 
2.28.0


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC PATCH 0/5] GuC submission / DRM scheduler integration plan + new uAPI
@ 2021-05-06 17:30 ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: carl.zhang, jason.ekstrand, daniel.vetter

Subject and patches say it all.

Initial post of GuC submission patches, detailed in patch 1, coming
shortly.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (5):
  drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
  drm/doc/rfc: i915 new parallel submission uAPI plan
  drm/i915: Expose logical engine instance to user
  drm/i915: Introduce 'set parallel submit' extension
  drm/i915: Update execbuf IOCTL to accept N BBs

 Documentation/gpu/rfc/i915_scheduler.rst | 122 ++++++++++++++++++
 Documentation/gpu/rfc/index.rst          |   4 +
 include/uapi/drm/i915_drm.h              | 154 ++++++++++++++++++++++-
 3 files changed, 278 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst

-- 
2.28.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
  2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
@ 2021-05-06 17:30   ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: matthew.brost, tony.ye, tvrtko.ursulin, daniele.ceraolospurio,
	carl.zhang, jason.ekstrand, jon.bloomfield, daniel.vetter,
	john.c.harrison

Add entry for i915 GuC submission / DRM scheduler integration plan.
Follow up patch with details of new parallel submission uAPI to come.

Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_scheduler.rst | 70 ++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst          |  4 ++
 2 files changed, 74 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst

diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
new file mode 100644
index 000000000000..fa6780a11c86
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -0,0 +1,70 @@
+=========================================
+I915 GuC Submission/DRM Scheduler Section
+=========================================
+
+Upstream plan
+=============
+For upstream the overall plan for landing GuC submission and integrating the
+i915 with the DRM scheduler is:
+
+* Merge basic GuC submission
+	* Basic submission support for all gen11+ platforms
+	* Not enabled by default on any current platforms but can be enabled via
+	  modparam enable_guc
+	* Lots of rework will need to be done to integrate with DRM scheduler so
+	  no need to nit pick everything in the code, it just should be
+	  functional and not regress execlists
+	* Update IGTs / selftests as needed to work with GuC submission
+	* Enable CI on supported platforms for a baseline
+	* Rework / get CI heathly for GuC submission in place as needed
+* Merge new parallel submission uAPI
+	* Bonding uAPI completely incompatible with GuC submission
+	* New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
+	  which configures contexts N wide
+	* After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
+	  a context in a single execbuf IOCTL and the batches run on the GPU in
+	  paralllel
+	* Initially only for GuC submission but execlists can be supported if
+	  needed
+* Convert the i915 to use the DRM scheduler
+	* GuC submission backend fully integrated with DRM scheduler
+		* All request queues removed from backend (e.g. all backpressure
+		  handled in DRM scheduler)
+		* Resets / cancels hook in DRM scheduler
+		* Watchdog hooks into DRM scheduler
+		* Lots of complexity of the GuC backend can be pulled out once
+		  integrated with DRM scheduler (e.g. state machine gets
+		  simplier, locking gets simplier, etc...)
+	* Execlist backend will do the minimum required to hook in the DRM
+	  scheduler so it can live next to the fully integrated GuC backend
+		* Legacy interface
+		* Features like timeslicing / preemption / virtual engines would
+		  be difficult to integrate with the DRM scheduler and these
+		  features are not required for GuC submission as the GuC does
+		  these things for us
+		* ROI low on fully integrating into DRM scheduler
+		* Fully integrating would add lots of complexity to DRM
+		  scheduler
+	* Port i915 priority inheritance / boosting feature in DRM scheduler
+	* Remove in-order completion assumptions from DRM scheduler
+	* Pull out i915 priority levels and use DRM priority levels
+	* Optimize DRM scheduler as needed
+
+New uAPI for basic GuC submission
+=================================
+No major changes are required to the uAPI for basic GuC submission. The only
+change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
+This attribute indicates the 2k i915 user priority levels are statically mapped
+into 3 levels as follows:
+
+* -1k to -1 Low priority
+* 0 Medium priority
+* 1 to 1k High priority
+
+This is needed because the GuC only has 4 priority bands. The highest priority
+band is reserved with the kernel. This aligns with the DRM scheduler priority
+levels too.
+
+New parallel submission uAPI
+============================
+Details to come in a following patch.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index a8621f7dab8b..018a8bf317a6 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -15,3 +15,7 @@ host such documentation:
 
 * Once the code has landed move all the documentation to the right places in
   the main core, helper or driver sections.
+
+.. toctree::
+
+    i915_scheduler.rst
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
@ 2021-05-06 17:30   ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: carl.zhang, jason.ekstrand, daniel.vetter

Add entry for i915 GuC submission / DRM scheduler integration plan.
Follow up patch with details of new parallel submission uAPI to come.

Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_scheduler.rst | 70 ++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst          |  4 ++
 2 files changed, 74 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst

diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
new file mode 100644
index 000000000000..fa6780a11c86
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -0,0 +1,70 @@
+=========================================
+I915 GuC Submission/DRM Scheduler Section
+=========================================
+
+Upstream plan
+=============
+For upstream the overall plan for landing GuC submission and integrating the
+i915 with the DRM scheduler is:
+
+* Merge basic GuC submission
+	* Basic submission support for all gen11+ platforms
+	* Not enabled by default on any current platforms but can be enabled via
+	  modparam enable_guc
+	* Lots of rework will need to be done to integrate with DRM scheduler so
+	  no need to nit pick everything in the code, it just should be
+	  functional and not regress execlists
+	* Update IGTs / selftests as needed to work with GuC submission
+	* Enable CI on supported platforms for a baseline
+	* Rework / get CI heathly for GuC submission in place as needed
+* Merge new parallel submission uAPI
+	* Bonding uAPI completely incompatible with GuC submission
+	* New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
+	  which configures contexts N wide
+	* After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
+	  a context in a single execbuf IOCTL and the batches run on the GPU in
+	  paralllel
+	* Initially only for GuC submission but execlists can be supported if
+	  needed
+* Convert the i915 to use the DRM scheduler
+	* GuC submission backend fully integrated with DRM scheduler
+		* All request queues removed from backend (e.g. all backpressure
+		  handled in DRM scheduler)
+		* Resets / cancels hook in DRM scheduler
+		* Watchdog hooks into DRM scheduler
+		* Lots of complexity of the GuC backend can be pulled out once
+		  integrated with DRM scheduler (e.g. state machine gets
+		  simplier, locking gets simplier, etc...)
+	* Execlist backend will do the minimum required to hook in the DRM
+	  scheduler so it can live next to the fully integrated GuC backend
+		* Legacy interface
+		* Features like timeslicing / preemption / virtual engines would
+		  be difficult to integrate with the DRM scheduler and these
+		  features are not required for GuC submission as the GuC does
+		  these things for us
+		* ROI low on fully integrating into DRM scheduler
+		* Fully integrating would add lots of complexity to DRM
+		  scheduler
+	* Port i915 priority inheritance / boosting feature in DRM scheduler
+	* Remove in-order completion assumptions from DRM scheduler
+	* Pull out i915 priority levels and use DRM priority levels
+	* Optimize DRM scheduler as needed
+
+New uAPI for basic GuC submission
+=================================
+No major changes are required to the uAPI for basic GuC submission. The only
+change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
+This attribute indicates the 2k i915 user priority levels are statically mapped
+into 3 levels as follows:
+
+* -1k to -1 Low priority
+* 0 Medium priority
+* 1 to 1k High priority
+
+This is needed because the GuC only has 4 priority bands. The highest priority
+band is reserved with the kernel. This aligns with the DRM scheduler priority
+levels too.
+
+New parallel submission uAPI
+============================
+Details to come in a following patch.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index a8621f7dab8b..018a8bf317a6 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -15,3 +15,7 @@ host such documentation:
 
 * Once the code has landed move all the documentation to the right places in
   the main core, helper or driver sections.
+
+.. toctree::
+
+    i915_scheduler.rst
-- 
2.28.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan
  2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
@ 2021-05-06 17:30   ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: matthew.brost, tony.ye, tvrtko.ursulin, daniele.ceraolospurio,
	carl.zhang, jason.ekstrand, jon.bloomfield, daniel.vetter,
	john.c.harrison

Add entry fpr i915 new parallel submission uAPI plan.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_scheduler.rst | 56 +++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index fa6780a11c86..e3455b33edfe 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -13,7 +13,8 @@ i915 with the DRM scheduler is:
 	  modparam enable_guc
 	* Lots of rework will need to be done to integrate with DRM scheduler so
 	  no need to nit pick everything in the code, it just should be
-	  functional and not regress execlists
+	  functional, no major coding style / layering errors, and not regress
+	  execlists
 	* Update IGTs / selftests as needed to work with GuC submission
 	* Enable CI on supported platforms for a baseline
 	* Rework / get CI heathly for GuC submission in place as needed
@@ -67,4 +68,55 @@ levels too.
 
 New parallel submission uAPI
 ============================
-Details to come in a following patch.
+The existing bonding uAPI is completely broken with GuC submission because
+whether a submission is a single context submit or parallel submit isn't known
+until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
+contexts in parallel with the GuC the context must be explictly registered with
+N contexts and all N contexts must be submitted in a single command to the GuC.
+This interfaces doesn't support dynamically changing between N contexts as the
+bonding uAPI does. Hence the need for a new parallel submission interface. Also
+the legacy bonding uAPI is quite confusing and not intuitive at all.
+
+The new parallel submission uAPI consists of 3 parts:
+
+* Export engines logical mapping
+* A 'set_parallel' extension to configure contexts for parallel
+  submission
+* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
+
+Export engines logical mapping
+------------------------------
+Certain use cases require BBs to be placed on engine instances in logical order
+(e.g. split-frame on gen11+). The logical mapping of engine instances can change
+based on fusing. Rather than making UMDs be aware of fusing, simply expose the
+logical mapping with the existing query engine info IOCTL. Also the GuC
+submission interface currently only supports submitting multiple contexts to
+engines in logical order.
+
+A single bit will be added to drm_i915_engine_info.flags indicating that the
+logical instance has been returned and a new field,
+drm_i915_engine_info.logical_instance, returns the logical instance.
+
+A 'set_parallel' extension to configure contexts for parallel submission
+------------------------------------------------------------------------
+The 'set_parallel' extension configures N contexts for parallel submission. It
+is setup step that should be called before using any of the contexts. See
+I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
+similar existing examples. Once the N contexts are configured for parallel
+submission the execbuf2 IOCTL can be called submiting 1-N BBs in a single IOCTL.
+Although submitting less than N BBs is allowed it is not recommended as that
+will likely leave parts of the hardware reserved and idle. Initially only
+support GuC submission. Execlist support can be added later if needed.
+
+Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
+i915_context_engines_parallel_submit to the uAPI to implement this extension.
+
+Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
+-------------------------------------------------------------------
+Contexts that have been configured with the 'set_parallel' extension are allowed
+to submit 1-N BBs in a single execbuf2 IOCTL. The BBs are either the last N
+objects in the drm_i915_gem_exec_object2 list or the first N if
+I915_EXEC_BATCH_FIRST is set.
+
+Add field 6 bit wide field to drm_i915_gem_exec_object2.flags which indicates
+the number of BBs - 1 included in the IOCTL.
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan
@ 2021-05-06 17:30   ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: carl.zhang, jason.ekstrand, daniel.vetter

Add entry fpr i915 new parallel submission uAPI plan.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/rfc/i915_scheduler.rst | 56 +++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
index fa6780a11c86..e3455b33edfe 100644
--- a/Documentation/gpu/rfc/i915_scheduler.rst
+++ b/Documentation/gpu/rfc/i915_scheduler.rst
@@ -13,7 +13,8 @@ i915 with the DRM scheduler is:
 	  modparam enable_guc
 	* Lots of rework will need to be done to integrate with DRM scheduler so
 	  no need to nit pick everything in the code, it just should be
-	  functional and not regress execlists
+	  functional, no major coding style / layering errors, and not regress
+	  execlists
 	* Update IGTs / selftests as needed to work with GuC submission
 	* Enable CI on supported platforms for a baseline
 	* Rework / get CI heathly for GuC submission in place as needed
@@ -67,4 +68,55 @@ levels too.
 
 New parallel submission uAPI
 ============================
-Details to come in a following patch.
+The existing bonding uAPI is completely broken with GuC submission because
+whether a submission is a single context submit or parallel submit isn't known
+until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
+contexts in parallel with the GuC the context must be explictly registered with
+N contexts and all N contexts must be submitted in a single command to the GuC.
+This interfaces doesn't support dynamically changing between N contexts as the
+bonding uAPI does. Hence the need for a new parallel submission interface. Also
+the legacy bonding uAPI is quite confusing and not intuitive at all.
+
+The new parallel submission uAPI consists of 3 parts:
+
+* Export engines logical mapping
+* A 'set_parallel' extension to configure contexts for parallel
+  submission
+* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
+
+Export engines logical mapping
+------------------------------
+Certain use cases require BBs to be placed on engine instances in logical order
+(e.g. split-frame on gen11+). The logical mapping of engine instances can change
+based on fusing. Rather than making UMDs be aware of fusing, simply expose the
+logical mapping with the existing query engine info IOCTL. Also the GuC
+submission interface currently only supports submitting multiple contexts to
+engines in logical order.
+
+A single bit will be added to drm_i915_engine_info.flags indicating that the
+logical instance has been returned and a new field,
+drm_i915_engine_info.logical_instance, returns the logical instance.
+
+A 'set_parallel' extension to configure contexts for parallel submission
+------------------------------------------------------------------------
+The 'set_parallel' extension configures N contexts for parallel submission. It
+is setup step that should be called before using any of the contexts. See
+I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
+similar existing examples. Once the N contexts are configured for parallel
+submission the execbuf2 IOCTL can be called submiting 1-N BBs in a single IOCTL.
+Although submitting less than N BBs is allowed it is not recommended as that
+will likely leave parts of the hardware reserved and idle. Initially only
+support GuC submission. Execlist support can be added later if needed.
+
+Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
+i915_context_engines_parallel_submit to the uAPI to implement this extension.
+
+Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
+-------------------------------------------------------------------
+Contexts that have been configured with the 'set_parallel' extension are allowed
+to submit 1-N BBs in a single execbuf2 IOCTL. The BBs are either the last N
+objects in the drm_i915_gem_exec_object2 list or the first N if
+I915_EXEC_BATCH_FIRST is set.
+
+Add field 6 bit wide field to drm_i915_gem_exec_object2.flags which indicates
+the number of BBs - 1 included in the IOCTL.
-- 
2.28.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 3/5] drm/i915: Expose logical engine instance to user
  2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
@ 2021-05-06 17:30   ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: matthew.brost, tony.ye, tvrtko.ursulin, daniele.ceraolospurio,
	carl.zhang, jason.ekstrand, jon.bloomfield, daniel.vetter,
	john.c.harrison

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these need to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/uapi/drm/i915_drm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 9f331ad629f5..26d2e135aa31 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2396,14 +2396,19 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** Logical engine instance */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC PATCH 3/5] drm/i915: Expose logical engine instance to user
@ 2021-05-06 17:30   ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: carl.zhang, jason.ekstrand, daniel.vetter

Expose logical engine instance to user via query engine info IOCTL. This
is required for split-frame workloads as these need to be placed on
engines in a logically contiguous order. The logical mapping can change
based on fusing. Rather than having user have knowledge of the fusing we
simply just expose the logical mapping with the existing query engine
info IOCTL.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/uapi/drm/i915_drm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 9f331ad629f5..26d2e135aa31 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2396,14 +2396,19 @@ struct drm_i915_engine_info {
 
 	/** @flags: Engine flags. */
 	__u64 flags;
+#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
 
 	/** @capabilities: Capabilities of this engine. */
 	__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
 
+	/** Logical engine instance */
+	__u16 logical_instance;
+
 	/** @rsvd1: Reserved fields. */
-	__u64 rsvd1[4];
+	__u16 rsvd1[3];
+	__u64 rsvd2[3];
 };
 
 /**
-- 
2.28.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
@ 2021-05-06 17:30   ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: matthew.brost, tony.ye, tvrtko.ursulin, daniele.ceraolospurio,
	carl.zhang, jason.ekstrand, jon.bloomfield, daniel.vetter,
	john.c.harrison

i915_drm.h updates for 'set parallel submit' extension.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 26d2e135aa31..0175b12b33b8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
+/*
+ * i915_context_engines_parallel_submit:
+ *
+ * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
+ * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
+ *
+ * All hardware contexts in the engine set are configured for parallel
+ * submission (i.e. once this gem context is configured for parallel submission,
+ * all the hardware contexts, regardless if a BB is available on each individual
+ * context, will be submitted to the GPU in parallel). A user can submit BBs to
+ * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
+ * recommended as it may reserve physical engines with nothing to run on them.
+ * Highly recommended to configure the gem context with N hardware contexts then
+ * always submit N BBs in a single IOCTL.
+ *
+ * Their are two currently defined ways to control the placement of the
+ * hardware contexts on physical engines: default behavior (no flags) and
+ * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
+ * future as new hardware / use cases arise. Details of how to use this
+ * interface below above the flags.
+ *
+ * Returns -EINVAL if hardware context placement configuration invalid or if the
+ * placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * inteface.
+ */
+struct i915_context_engines_parallel_submit {
+	struct i915_user_extension base;
+
+/*
+ * Default placement behvavior (currently unsupported):
+ *
+ * Rather than restricting parallel submission to a single class with a
+ * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
+ * enables parallel submission across multiple engine classes. In this case each
+ * context's logical engine mask indicates where that context can placed. It is
+ * implied in this mode that all contexts have mutual exclusive placement (e.g.
+ * if one context is running CS0 no other contexts can run on CS0).
+ *
+ * Example 1 pseudo code:
+ * CSX[Y] = engine class X, logical instance Y
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID, INVALID)
+ * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
+ * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
+ * set_parallel()
+ *
+ * Results in the following valid placements:
+ * CS0[0], CS1[0]
+ * CS0[0], CS1[1]
+ * CS0[1], CS1[0]
+ * CS0[1], CS1[1]
+ *
+ * Example 2 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID, INVALID)
+ * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
+ * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
+ * set_parallel()
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ * CS[0], CS[2]
+ * CS[1], CS[0]
+ * CS[1], CS[2]
+ * CS[2], CS[0]
+ * CS[2], CS[1]
+ *
+ * This enables a use case where all engines are created equally, we don't care
+ * where they are scheduled, we just want a certain number of resources, for
+ * those resources to be scheduled in parallel, and possibly across multiple
+ * engine classes.
+ */
+
+/*
+ * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
+ * Each context must have the same number sibling and bonds are implictly create
+ * of the siblings.
+ *
+ * All of the below examples are in logical space.
+ *
+ * Example 1 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * set_engines(CS[0], CS[1])
+ * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ *
+ * Example 2 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID, INVALID)
+ * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
+ * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
+ * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ * CS[2], CS[3]
+ *
+ * This enables a use case where all engines are not equal and certain placement
+ * rules are required (i.e. split-frame requires all contexts to be placed in a
+ * logically contiguous order on the VCS engines on gen11+ platforms). This use
+ * case (logically contiguous placement, within a single engine class) is
+ * supported when using GuC submission. Execlist mode could support all possible
+ * bonding configurations but currently doesn't support this extension.
+ */
+#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
+/*
+ * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
+ * points on all hardware contexts between each set of BBs. An example use case
+ * of this feature is split-frame on gen11+ hardware. When using this feature a
+ * BB must be submitted on each hardware context in the parallel gem context.
+ * The execbuf2 IOCTL enforces the user adheres to policy.
+ */
+#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
+#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
+	__u64 flags; /* all undefined flags must be zero */
+	__u64 mbz64[4]; /* reserved for future use; must be zero */
+} __attribute__ ((packed));
+
 #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
 	__u64 extensions; \
 	struct i915_engine_class_instance engines[N__]; \
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-06 17:30   ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: carl.zhang, jason.ekstrand, daniel.vetter

i915_drm.h updates for 'set parallel submit' extension.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 26d2e135aa31..0175b12b33b8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
  * Extensions:
  *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
  *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
+ *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
  */
 #define I915_CONTEXT_PARAM_ENGINES	0xa
 
@@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
 	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
 #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
 #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
+#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
 	struct i915_engine_class_instance engines[0];
 } __attribute__((packed));
 
+/*
+ * i915_context_engines_parallel_submit:
+ *
+ * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
+ * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
+ *
+ * All hardware contexts in the engine set are configured for parallel
+ * submission (i.e. once this gem context is configured for parallel submission,
+ * all the hardware contexts, regardless if a BB is available on each individual
+ * context, will be submitted to the GPU in parallel). A user can submit BBs to
+ * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
+ * recommended as it may reserve physical engines with nothing to run on them.
+ * Highly recommended to configure the gem context with N hardware contexts then
+ * always submit N BBs in a single IOCTL.
+ *
+ * Their are two currently defined ways to control the placement of the
+ * hardware contexts on physical engines: default behavior (no flags) and
+ * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
+ * future as new hardware / use cases arise. Details of how to use this
+ * interface below above the flags.
+ *
+ * Returns -EINVAL if hardware context placement configuration invalid or if the
+ * placement configuration isn't supported on the platform / submission
+ * interface.
+ * Returns -ENODEV if extension isn't supported on the platform / submission
+ * inteface.
+ */
+struct i915_context_engines_parallel_submit {
+	struct i915_user_extension base;
+
+/*
+ * Default placement behvavior (currently unsupported):
+ *
+ * Rather than restricting parallel submission to a single class with a
+ * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
+ * enables parallel submission across multiple engine classes. In this case each
+ * context's logical engine mask indicates where that context can placed. It is
+ * implied in this mode that all contexts have mutual exclusive placement (e.g.
+ * if one context is running CS0 no other contexts can run on CS0).
+ *
+ * Example 1 pseudo code:
+ * CSX[Y] = engine class X, logical instance Y
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID, INVALID)
+ * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
+ * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
+ * set_parallel()
+ *
+ * Results in the following valid placements:
+ * CS0[0], CS1[0]
+ * CS0[0], CS1[1]
+ * CS0[1], CS1[0]
+ * CS0[1], CS1[1]
+ *
+ * Example 2 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID, INVALID)
+ * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
+ * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
+ * set_parallel()
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ * CS[0], CS[2]
+ * CS[1], CS[0]
+ * CS[1], CS[2]
+ * CS[2], CS[0]
+ * CS[2], CS[1]
+ *
+ * This enables a use case where all engines are created equally, we don't care
+ * where they are scheduled, we just want a certain number of resources, for
+ * those resources to be scheduled in parallel, and possibly across multiple
+ * engine classes.
+ */
+
+/*
+ * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
+ * Each context must have the same number sibling and bonds are implictly create
+ * of the siblings.
+ *
+ * All of the below examples are in logical space.
+ *
+ * Example 1 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * set_engines(CS[0], CS[1])
+ * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ *
+ * Example 2 pseudo code:
+ * CS[X] = generic engine of same class, logical instance X
+ * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
+ * set_engines(INVALID, INVALID)
+ * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
+ * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
+ * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
+ *
+ * Results in the following valid placements:
+ * CS[0], CS[1]
+ * CS[2], CS[3]
+ *
+ * This enables a use case where all engines are not equal and certain placement
+ * rules are required (i.e. split-frame requires all contexts to be placed in a
+ * logically contiguous order on the VCS engines on gen11+ platforms). This use
+ * case (logically contiguous placement, within a single engine class) is
+ * supported when using GuC submission. Execlist mode could support all possible
+ * bonding configurations but currently doesn't support this extension.
+ */
+#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
+/*
+ * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
+ * points on all hardware contexts between each set of BBs. An example use case
+ * of this feature is split-frame on gen11+ hardware. When using this feature a
+ * BB must be submitted on each hardware context in the parallel gem context.
+ * The execbuf2 IOCTL enforces the user adheres to policy.
+ */
+#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
+#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
+	__u64 flags; /* all undefined flags must be zero */
+	__u64 mbz64[4]; /* reserved for future use; must be zero */
+} __attribute__ ((packed));
+
 #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
 	__u64 extensions; \
 	struct i915_engine_class_instance engines[N__]; \
-- 
2.28.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs
  2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
@ 2021-05-06 17:30   ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: matthew.brost, tony.ye, tvrtko.ursulin, daniele.ceraolospurio,
	carl.zhang, jason.ekstrand, jon.bloomfield, daniel.vetter,
	john.c.harrison

Add I915_EXEC_NUMBER_BB_* to drm_i915_gem_execbuffer2.flags which allows
submitting N BBs per IOCTL.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/uapi/drm/i915_drm.h | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 0175b12b33b8..d3072cad4a7e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1291,7 +1291,26 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
+/*
+ * Number of BB in execbuf2 IOCTL - 1, used to submit more than BB in a single
+ * execbuf2 IOCTL.
+ *
+ * Return -EINVAL if more than 1 BB (value 0) is specified if
+ * I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT hasn't been called on the gem
+ * context first. Also returns -EINVAL if gem context has been setup with
+ * I915_PARALLEL_NO_PREEMPT_MID_BATCH and the number BBs not equal to the total
+ * number hardware contexts in the gem context.
+ */
+#define I915_EXEC_NUMBER_BB_LSB		(22)
+#define I915_EXEC_NUMBER_BB_MASK	(0x3f << I915_EXEC_NUMBER_BB_LSB)
+#define I915_EXEC_NUMBER_BB_MSB		(27)
+#define i915_execbuffer2_set_number_bb(eb2, num_bb) \
+	(eb2).flags = ((eb2).flags & ~I915_EXEC_NUMBER_BB_MASK) | \
+	(((num_bb - 1) << I915_EXEC_NUMBER_BB_LSB) & I915_EXEC_NUMBER_BB_MASK)
+#define i915_execbuffer2_get_number_bb(eb2) \
+	((((eb2).flags & I915_EXEC_NUMBER_BB_MASK) >> I915_EXEC_NUMBER_BB_LSB) + 1)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(1 << (I915_EXEC_NUMBER_BB_MSB + 1)))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Intel-gfx] [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs
@ 2021-05-06 17:30   ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-06 17:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: carl.zhang, jason.ekstrand, daniel.vetter

Add I915_EXEC_NUMBER_BB_* to drm_i915_gem_execbuffer2.flags which allows
submitting N BBs per IOCTL.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
CC: Carl Zhang <carl.zhang@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/uapi/drm/i915_drm.h | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 0175b12b33b8..d3072cad4a7e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1291,7 +1291,26 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
+/*
+ * Number of BB in execbuf2 IOCTL - 1, used to submit more than BB in a single
+ * execbuf2 IOCTL.
+ *
+ * Return -EINVAL if more than 1 BB (value 0) is specified if
+ * I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT hasn't been called on the gem
+ * context first. Also returns -EINVAL if gem context has been setup with
+ * I915_PARALLEL_NO_PREEMPT_MID_BATCH and the number BBs not equal to the total
+ * number hardware contexts in the gem context.
+ */
+#define I915_EXEC_NUMBER_BB_LSB		(22)
+#define I915_EXEC_NUMBER_BB_MASK	(0x3f << I915_EXEC_NUMBER_BB_LSB)
+#define I915_EXEC_NUMBER_BB_MSB		(27)
+#define i915_execbuffer2_set_number_bb(eb2, num_bb) \
+	(eb2).flags = ((eb2).flags & ~I915_EXEC_NUMBER_BB_MASK) | \
+	(((num_bb - 1) << I915_EXEC_NUMBER_BB_LSB) & I915_EXEC_NUMBER_BB_MASK)
+#define i915_execbuffer2_get_number_bb(eb2) \
+	((((eb2).flags & I915_EXEC_NUMBER_BB_MASK) >> I915_EXEC_NUMBER_BB_LSB) + 1)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(1 << (I915_EXEC_NUMBER_BB_MSB + 1)))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
2.28.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
  2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
@ 2021-05-11 14:34     ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 14:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:45AM -0700, Matthew Brost wrote:
> Add entry for i915 GuC submission / DRM scheduler integration plan.
> Follow up patch with details of new parallel submission uAPI to come.
> 
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: dri-devel@lists.freedesktop.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Would be good to Cc: some drm/scheduler folks here for the next round:

$ scripts/get_maintainer.pl -f -- drivers/gpu/drm/scheduler/

says we have maybe the following missing:

"Christian König" <christian.koenig@amd.com>
Luben Tuikov <luben.tuikov@amd.com>
Alex Deucher <alexander.deucher@amd.com>
Steven Price <steven.price@arm.com>

Lee Jones did a ton of warning fixes over the entire tree, so doesn't care
about drm/scheduler design directly.

> ---
>  Documentation/gpu/rfc/i915_scheduler.rst | 70 ++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst          |  4 ++
>  2 files changed, 74 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> 
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> new file mode 100644
> index 000000000000..fa6780a11c86
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -0,0 +1,70 @@
> +=========================================
> +I915 GuC Submission/DRM Scheduler Section
> +=========================================
> +
> +Upstream plan
> +=============
> +For upstream the overall plan for landing GuC submission and integrating the
> +i915 with the DRM scheduler is:
> +
> +* Merge basic GuC submission
> +	* Basic submission support for all gen11+ platforms
> +	* Not enabled by default on any current platforms but can be enabled via
> +	  modparam enable_guc
> +	* Lots of rework will need to be done to integrate with DRM scheduler so
> +	  no need to nit pick everything in the code, it just should be
> +	  functional and not regress execlists
> +	* Update IGTs / selftests as needed to work with GuC submission
> +	* Enable CI on supported platforms for a baseline
> +	* Rework / get CI heathly for GuC submission in place as needed
> +* Merge new parallel submission uAPI
> +	* Bonding uAPI completely incompatible with GuC submission

Maybe clarify that this isn't the only issue with the bonding uapi, so
perhaps add "Plus it has severe design issues in general, which is why we
want to retire it no matter what". Or something like that. Not sure we
should go into full details here, maybe as part of the next patch about
parallel submit and all that.

> +	* New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
> +	  which configures contexts N wide
> +	* After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
> +	  a context in a single execbuf IOCTL and the batches run on the GPU in
> +	  paralllel
> +	* Initially only for GuC submission but execlists can be supported if
> +	  needed
> +* Convert the i915 to use the DRM scheduler
> +	* GuC submission backend fully integrated with DRM scheduler
> +		* All request queues removed from backend (e.g. all backpressure
> +		  handled in DRM scheduler)
> +		* Resets / cancels hook in DRM scheduler
> +		* Watchdog hooks into DRM scheduler
> +		* Lots of complexity of the GuC backend can be pulled out once
> +		  integrated with DRM scheduler (e.g. state machine gets
> +		  simplier, locking gets simplier, etc...)
> +	* Execlist backend will do the minimum required to hook in the DRM
> +	  scheduler so it can live next to the fully integrated GuC backend
> +		* Legacy interface
> +		* Features like timeslicing / preemption / virtual engines would
> +		  be difficult to integrate with the DRM scheduler and these
> +		  features are not required for GuC submission as the GuC does
> +		  these things for us
> +		* ROI low on fully integrating into DRM scheduler
> +		* Fully integrating would add lots of complexity to DRM
> +		  scheduler
> +	* Port i915 priority inheritance / boosting feature in DRM scheduler

Maybe a few words on what this does and why we care? Just so drm/scheduler
people know what's coming.

> +	* Remove in-order completion assumptions from DRM scheduler

I think it'd be good to put a few words here why we need this. We want to
use drm scheduler for dependencies, but rely on the hw/fw scheduler (or
well backend for execlist) to handle preemption, round-robin and that kind
of stuff. Hence we want to have all runnable requests in the backend
(excluding backpressure and stuff like that), and they can complete
out-of-order.

Maybe also highlight this one in the commit message to get drm/scheduler
folks' attention on this and the previous one for discussion.

> +	* Pull out i915 priority levels and use DRM priority levels
> +	* Optimize DRM scheduler as needed

Again if we have some items here (one that was discussed was direct
submit/retire for lower latency iirc?) might be good to list this here.

> +
> +New uAPI for basic GuC submission
> +=================================
> +No major changes are required to the uAPI for basic GuC submission. The only
> +change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
> +This attribute indicates the 2k i915 user priority levels are statically mapped
> +into 3 levels as follows:
> +
> +* -1k to -1 Low priority
> +* 0 Medium priority
> +* 1 to 1k High priority
> +
> +This is needed because the GuC only has 4 priority bands. The highest priority
> +band is reserved with the kernel. This aligns with the DRM scheduler priority
> +levels too.

Please Cc: mesa and get an ack from Jason Ekstrand or Ken Graunke on this,
just to be sure.

> +
> +New parallel submission uAPI
> +============================
> +Details to come in a following patch.
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index a8621f7dab8b..018a8bf317a6 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -15,3 +15,7 @@ host such documentation:
>  
>  * Once the code has landed move all the documentation to the right places in
>    the main core, helper or driver sections.
> +
> +.. toctree::
> +
> +    i915_scheduler.rst

Aside from the comments I think this is what we're aiming for wrt rfc
patches, so lgtm overall.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
@ 2021-05-11 14:34     ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 14:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:45AM -0700, Matthew Brost wrote:
> Add entry for i915 GuC submission / DRM scheduler integration plan.
> Follow up patch with details of new parallel submission uAPI to come.
> 
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: dri-devel@lists.freedesktop.org
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Would be good to Cc: some drm/scheduler folks here for the next round:

$ scripts/get_maintainer.pl -f -- drivers/gpu/drm/scheduler/

says we have maybe the following missing:

"Christian König" <christian.koenig@amd.com>
Luben Tuikov <luben.tuikov@amd.com>
Alex Deucher <alexander.deucher@amd.com>
Steven Price <steven.price@arm.com>

Lee Jones did a ton of warning fixes over the entire tree, so doesn't care
about drm/scheduler design directly.

> ---
>  Documentation/gpu/rfc/i915_scheduler.rst | 70 ++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst          |  4 ++
>  2 files changed, 74 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> 
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> new file mode 100644
> index 000000000000..fa6780a11c86
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -0,0 +1,70 @@
> +=========================================
> +I915 GuC Submission/DRM Scheduler Section
> +=========================================
> +
> +Upstream plan
> +=============
> +For upstream the overall plan for landing GuC submission and integrating the
> +i915 with the DRM scheduler is:
> +
> +* Merge basic GuC submission
> +	* Basic submission support for all gen11+ platforms
> +	* Not enabled by default on any current platforms but can be enabled via
> +	  modparam enable_guc
> +	* Lots of rework will need to be done to integrate with DRM scheduler so
> +	  no need to nit pick everything in the code, it just should be
> +	  functional and not regress execlists
> +	* Update IGTs / selftests as needed to work with GuC submission
> +	* Enable CI on supported platforms for a baseline
> +	* Rework / get CI heathly for GuC submission in place as needed
> +* Merge new parallel submission uAPI
> +	* Bonding uAPI completely incompatible with GuC submission

Maybe clarify that this isn't the only issue with the bonding uapi, so
perhaps add "Plus it has severe design issues in general, which is why we
want to retire it no matter what". Or something like that. Not sure we
should go into full details here, maybe as part of the next patch about
parallel submit and all that.

> +	* New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
> +	  which configures contexts N wide
> +	* After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
> +	  a context in a single execbuf IOCTL and the batches run on the GPU in
> +	  paralllel
> +	* Initially only for GuC submission but execlists can be supported if
> +	  needed
> +* Convert the i915 to use the DRM scheduler
> +	* GuC submission backend fully integrated with DRM scheduler
> +		* All request queues removed from backend (e.g. all backpressure
> +		  handled in DRM scheduler)
> +		* Resets / cancels hook in DRM scheduler
> +		* Watchdog hooks into DRM scheduler
> +		* Lots of complexity of the GuC backend can be pulled out once
> +		  integrated with DRM scheduler (e.g. state machine gets
> +		  simplier, locking gets simplier, etc...)
> +	* Execlist backend will do the minimum required to hook in the DRM
> +	  scheduler so it can live next to the fully integrated GuC backend
> +		* Legacy interface
> +		* Features like timeslicing / preemption / virtual engines would
> +		  be difficult to integrate with the DRM scheduler and these
> +		  features are not required for GuC submission as the GuC does
> +		  these things for us
> +		* ROI low on fully integrating into DRM scheduler
> +		* Fully integrating would add lots of complexity to DRM
> +		  scheduler
> +	* Port i915 priority inheritance / boosting feature in DRM scheduler

Maybe a few words on what this does and why we care? Just so drm/scheduler
people know what's coming.

> +	* Remove in-order completion assumptions from DRM scheduler

I think it'd be good to put a few words here why we need this. We want to
use drm scheduler for dependencies, but rely on the hw/fw scheduler (or
well backend for execlist) to handle preemption, round-robin and that kind
of stuff. Hence we want to have all runnable requests in the backend
(excluding backpressure and stuff like that), and they can complete
out-of-order.

Maybe also highlight this one in the commit message to get drm/scheduler
folks' attention on this and the previous one for discussion.

> +	* Pull out i915 priority levels and use DRM priority levels
> +	* Optimize DRM scheduler as needed

Again if we have some items here (one that was discussed was direct
submit/retire for lower latency iirc?) might be good to list this here.

> +
> +New uAPI for basic GuC submission
> +=================================
> +No major changes are required to the uAPI for basic GuC submission. The only
> +change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
> +This attribute indicates the 2k i915 user priority levels are statically mapped
> +into 3 levels as follows:
> +
> +* -1k to -1 Low priority
> +* 0 Medium priority
> +* 1 to 1k High priority
> +
> +This is needed because the GuC only has 4 priority bands. The highest priority
> +band is reserved with the kernel. This aligns with the DRM scheduler priority
> +levels too.

Please Cc: mesa and get an ack from Jason Ekstrand or Ken Graunke on this,
just to be sure.

> +
> +New parallel submission uAPI
> +============================
> +Details to come in a following patch.
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index a8621f7dab8b..018a8bf317a6 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -15,3 +15,7 @@ host such documentation:
>  
>  * Once the code has landed move all the documentation to the right places in
>    the main core, helper or driver sections.
> +
> +.. toctree::
> +
> +    i915_scheduler.rst

Aside from the comments I think this is what we're aiming for wrt rfc
patches, so lgtm overall.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan
  2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
@ 2021-05-11 14:49     ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 14:49 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:46AM -0700, Matthew Brost wrote:
> Add entry fpr i915 new parallel submission uAPI plan.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  Documentation/gpu/rfc/i915_scheduler.rst | 56 +++++++++++++++++++++++-
>  1 file changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> index fa6780a11c86..e3455b33edfe 100644
> --- a/Documentation/gpu/rfc/i915_scheduler.rst
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -13,7 +13,8 @@ i915 with the DRM scheduler is:
>  	  modparam enable_guc
>  	* Lots of rework will need to be done to integrate with DRM scheduler so
>  	  no need to nit pick everything in the code, it just should be
> -	  functional and not regress execlists
> +	  functional, no major coding style / layering errors, and not regress
> +	  execlists

I guess this hunk should be in the previous patch?

>  	* Update IGTs / selftests as needed to work with GuC submission
>  	* Enable CI on supported platforms for a baseline
>  	* Rework / get CI heathly for GuC submission in place as needed
> @@ -67,4 +68,55 @@ levels too.
>  
>  New parallel submission uAPI
>  ============================
> -Details to come in a following patch.
> +The existing bonding uAPI is completely broken with GuC submission because
> +whether a submission is a single context submit or parallel submit isn't known
> +until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
> +contexts in parallel with the GuC the context must be explictly registered with
> +N contexts and all N contexts must be submitted in a single command to the GuC.
> +This interfaces doesn't support dynamically changing between N contexts as the
> +bonding uAPI does. Hence the need for a new parallel submission interface. Also
> +the legacy bonding uAPI is quite confusing and not intuitive at all.

I think you should sit together with Jason on irc or so for a bit and get
an earful of how it's all broken irrespective of GuC submission or not.
Just to hammer in our case :-)

> +
> +The new parallel submission uAPI consists of 3 parts:
> +
> +* Export engines logical mapping
> +* A 'set_parallel' extension to configure contexts for parallel
> +  submission
> +* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> +
> +Export engines logical mapping
> +------------------------------
> +Certain use cases require BBs to be placed on engine instances in logical order
> +(e.g. split-frame on gen11+). The logical mapping of engine instances can change
> +based on fusing. Rather than making UMDs be aware of fusing, simply expose the
> +logical mapping with the existing query engine info IOCTL. Also the GuC
> +submission interface currently only supports submitting multiple contexts to
> +engines in logical order.

Maybe highlight more that this is a new restriction with GuC compared to
execlist, which is why we need to expose this information to userspace.
Also on the platforms thus far supported in upstream there's at most 2
engines of the same type, so really not an issue.

> +
> +A single bit will be added to drm_i915_engine_info.flags indicating that the
> +logical instance has been returned and a new field,
> +drm_i915_engine_info.logical_instance, returns the logical instance.
> +
> +A 'set_parallel' extension to configure contexts for parallel submission
> +------------------------------------------------------------------------
> +The 'set_parallel' extension configures N contexts for parallel submission. It
> +is setup step that should be called before using any of the contexts. See
> +I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
> +similar existing examples. Once the N contexts are configured for parallel
> +submission the execbuf2 IOCTL can be called submiting 1-N BBs in a single IOCTL.
> +Although submitting less than N BBs is allowed it is not recommended as that
> +will likely leave parts of the hardware reserved and idle. Initially only
> +support GuC submission. Execlist support can be added later if needed.

Can we just require that you always submit N batchbuffers, or does this
create a problem for userspace? Allowing things just because is generally
not a good idea with uapi, it's better to limit and then allow when
there's a need.

Ofc if we already have a need then explain why and that's all fine.

Also detailed comments on the kerneldoc I'll do in the next patches.

> +
> +Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
> +i915_context_engines_parallel_submit to the uAPI to implement this extension.
> +
> +Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> +-------------------------------------------------------------------
> +Contexts that have been configured with the 'set_parallel' extension are allowed
> +to submit 1-N BBs in a single execbuf2 IOCTL. The BBs are either the last N
> +objects in the drm_i915_gem_exec_object2 list or the first N if
> +I915_EXEC_BATCH_FIRST is set.
> +
> +Add field 6 bit wide field to drm_i915_gem_exec_object2.flags which indicates
> +the number of BBs - 1 included in the IOCTL.

Hm we have the nice execbuf extension chaining, any reason for not using
that and instead opting for clever field packing?

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan
@ 2021-05-11 14:49     ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 14:49 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:46AM -0700, Matthew Brost wrote:
> Add entry fpr i915 new parallel submission uAPI plan.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  Documentation/gpu/rfc/i915_scheduler.rst | 56 +++++++++++++++++++++++-
>  1 file changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> index fa6780a11c86..e3455b33edfe 100644
> --- a/Documentation/gpu/rfc/i915_scheduler.rst
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -13,7 +13,8 @@ i915 with the DRM scheduler is:
>  	  modparam enable_guc
>  	* Lots of rework will need to be done to integrate with DRM scheduler so
>  	  no need to nit pick everything in the code, it just should be
> -	  functional and not regress execlists
> +	  functional, no major coding style / layering errors, and not regress
> +	  execlists

I guess this hunk should be in the previous patch?

>  	* Update IGTs / selftests as needed to work with GuC submission
>  	* Enable CI on supported platforms for a baseline
>  	* Rework / get CI heathly for GuC submission in place as needed
> @@ -67,4 +68,55 @@ levels too.
>  
>  New parallel submission uAPI
>  ============================
> -Details to come in a following patch.
> +The existing bonding uAPI is completely broken with GuC submission because
> +whether a submission is a single context submit or parallel submit isn't known
> +until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
> +contexts in parallel with the GuC the context must be explictly registered with
> +N contexts and all N contexts must be submitted in a single command to the GuC.
> +This interfaces doesn't support dynamically changing between N contexts as the
> +bonding uAPI does. Hence the need for a new parallel submission interface. Also
> +the legacy bonding uAPI is quite confusing and not intuitive at all.

I think you should sit together with Jason on irc or so for a bit and get
an earful of how it's all broken irrespective of GuC submission or not.
Just to hammer in our case :-)

> +
> +The new parallel submission uAPI consists of 3 parts:
> +
> +* Export engines logical mapping
> +* A 'set_parallel' extension to configure contexts for parallel
> +  submission
> +* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> +
> +Export engines logical mapping
> +------------------------------
> +Certain use cases require BBs to be placed on engine instances in logical order
> +(e.g. split-frame on gen11+). The logical mapping of engine instances can change
> +based on fusing. Rather than making UMDs be aware of fusing, simply expose the
> +logical mapping with the existing query engine info IOCTL. Also the GuC
> +submission interface currently only supports submitting multiple contexts to
> +engines in logical order.

Maybe highlight more that this is a new restriction with GuC compared to
execlist, which is why we need to expose this information to userspace.
Also on the platforms thus far supported in upstream there's at most 2
engines of the same type, so really not an issue.

> +
> +A single bit will be added to drm_i915_engine_info.flags indicating that the
> +logical instance has been returned and a new field,
> +drm_i915_engine_info.logical_instance, returns the logical instance.
> +
> +A 'set_parallel' extension to configure contexts for parallel submission
> +------------------------------------------------------------------------
> +The 'set_parallel' extension configures N contexts for parallel submission. It
> +is setup step that should be called before using any of the contexts. See
> +I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
> +similar existing examples. Once the N contexts are configured for parallel
> +submission the execbuf2 IOCTL can be called submiting 1-N BBs in a single IOCTL.
> +Although submitting less than N BBs is allowed it is not recommended as that
> +will likely leave parts of the hardware reserved and idle. Initially only
> +support GuC submission. Execlist support can be added later if needed.

Can we just require that you always submit N batchbuffers, or does this
create a problem for userspace? Allowing things just because is generally
not a good idea with uapi, it's better to limit and then allow when
there's a need.

Ofc if we already have a need then explain why and that's all fine.

Also detailed comments on the kerneldoc I'll do in the next patches.

> +
> +Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
> +i915_context_engines_parallel_submit to the uAPI to implement this extension.
> +
> +Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> +-------------------------------------------------------------------
> +Contexts that have been configured with the 'set_parallel' extension are allowed
> +to submit 1-N BBs in a single execbuf2 IOCTL. The BBs are either the last N
> +objects in the drm_i915_gem_exec_object2 list or the first N if
> +I915_EXEC_BATCH_FIRST is set.
> +
> +Add field 6 bit wide field to drm_i915_gem_exec_object2.flags which indicates
> +the number of BBs - 1 included in the IOCTL.

Hm we have the nice execbuf extension chaining, any reason for not using
that and instead opting for clever field packing?

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 3/5] drm/i915: Expose logical engine instance to user
  2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
@ 2021-05-11 14:53     ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 14:53 UTC (permalink / raw)
  To: Matthew Brost
  Cc: tony.ye, tvrtko.ursulin, intel-gfx, dri-devel, carl.zhang,
	jason.ekstrand, daniele.ceraolospurio, jon.bloomfield,
	daniel.vetter, john.c.harrison

On Thu, May 06, 2021 at 10:30:47AM -0700, Matthew Brost wrote:
> Expose logical engine instance to user via query engine info IOCTL. This
> is required for split-frame workloads as these need to be placed on
> engines in a logically contiguous order. The logical mapping can change
> based on fusing. Rather than having user have knowledge of the fusing we
> simply just expose the logical mapping with the existing query engine
> info IOCTL.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 7 ++++++-

Two things on all these 3 patches:

- Until we've merged the uapi it shouldn't show up in uapi headers. See
  what Matt A. has done with a fake local header in Documentation/gpu/rfc
  which you can pull in.

- Since this one is tiny I think just the text in the rfc is good enough,
  I'd drop this.

- Squash the others in with the parallel submit rfc patch so that the
  structs and long-form text are all in one patch please, makes reviewing
  the overall thing a bit simpler. Rule is to have a complete change per
  patch, and then not split things further.

>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 9f331ad629f5..26d2e135aa31 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -2396,14 +2396,19 @@ struct drm_i915_engine_info {
>  
>  	/** @flags: Engine flags. */
>  	__u64 flags;
> +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
>  
>  	/** @capabilities: Capabilities of this engine. */
>  	__u64 capabilities;
>  #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
>  #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
>  
> +	/** Logical engine instance */

I think in the final version that we merge with the uapi this should:
- explain why we need this
- link to relevant other uapi like the paralle submit extension

Cheers, Daniel

> +	__u16 logical_instance;
> +
>  	/** @rsvd1: Reserved fields. */
> -	__u64 rsvd1[4];
> +	__u16 rsvd1[3];
> +	__u64 rsvd2[3];
>  };
>  
>  /**
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 3/5] drm/i915: Expose logical engine instance to user
@ 2021-05-11 14:53     ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 14:53 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, carl.zhang, jason.ekstrand, daniel.vetter

On Thu, May 06, 2021 at 10:30:47AM -0700, Matthew Brost wrote:
> Expose logical engine instance to user via query engine info IOCTL. This
> is required for split-frame workloads as these need to be placed on
> engines in a logically contiguous order. The logical mapping can change
> based on fusing. Rather than having user have knowledge of the fusing we
> simply just expose the logical mapping with the existing query engine
> info IOCTL.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 7 ++++++-

Two things on all these 3 patches:

- Until we've merged the uapi it shouldn't show up in uapi headers. See
  what Matt A. has done with a fake local header in Documentation/gpu/rfc
  which you can pull in.

- Since this one is tiny I think just the text in the rfc is good enough,
  I'd drop this.

- Squash the others in with the parallel submit rfc patch so that the
  structs and long-form text are all in one patch please, makes reviewing
  the overall thing a bit simpler. Rule is to have a complete change per
  patch, and then not split things further.

>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 9f331ad629f5..26d2e135aa31 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -2396,14 +2396,19 @@ struct drm_i915_engine_info {
>  
>  	/** @flags: Engine flags. */
>  	__u64 flags;
> +#define I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE		(1 << 0)
>  
>  	/** @capabilities: Capabilities of this engine. */
>  	__u64 capabilities;
>  #define I915_VIDEO_CLASS_CAPABILITY_HEVC		(1 << 0)
>  #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC	(1 << 1)
>  
> +	/** Logical engine instance */

I think in the final version that we merge with the uapi this should:
- explain why we need this
- link to relevant other uapi like the paralle submit extension

Cheers, Daniel

> +	__u16 logical_instance;
> +
>  	/** @rsvd1: Reserved fields. */
> -	__u64 rsvd1[4];
> +	__u16 rsvd1[3];
> +	__u64 rsvd2[3];
>  };
>  
>  /**
> -- 
> 2.28.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
  2021-05-11 14:34     ` Daniel Vetter
@ 2021-05-11 14:58       ` Daniel Stone
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Stone @ 2021-05-11 14:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Matthew Brost, intel-gfx, dri-devel, carl.zhang, Jason Ekstrand,
	Vetter, Daniel

Hi,

On Tue, 11 May 2021 at 15:34, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, May 06, 2021 at 10:30:45AM -0700, Matthew Brost wrote:
> > +No major changes are required to the uAPI for basic GuC submission. The only
> > +change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
> > +This attribute indicates the 2k i915 user priority levels are statically mapped
> > +into 3 levels as follows:
> > +
> > +* -1k to -1 Low priority
> > +* 0 Medium priority
> > +* 1 to 1k High priority
> > +
> > +This is needed because the GuC only has 4 priority bands. The highest priority
> > +band is reserved with the kernel. This aligns with the DRM scheduler priority
> > +levels too.
>
> Please Cc: mesa and get an ack from Jason Ekstrand or Ken Graunke on this,
> just to be sure.

A reference to the actual specs this targets would help. I don't have
oneAPI to hand if it's relevant, but the two in graphics world are
https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
and https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap5.html#devsandqueues-priority
- both of them pretty much say that the implementation may do anything
or nothing at all, so this isn't a problem for spec conformance, only
a matter of user priority (sorry).

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
@ 2021-05-11 14:58       ` Daniel Stone
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Stone @ 2021-05-11 14:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-gfx, dri-devel, carl.zhang, Jason Ekstrand, Vetter, Daniel

Hi,

On Tue, 11 May 2021 at 15:34, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, May 06, 2021 at 10:30:45AM -0700, Matthew Brost wrote:
> > +No major changes are required to the uAPI for basic GuC submission. The only
> > +change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
> > +This attribute indicates the 2k i915 user priority levels are statically mapped
> > +into 3 levels as follows:
> > +
> > +* -1k to -1 Low priority
> > +* 0 Medium priority
> > +* 1 to 1k High priority
> > +
> > +This is needed because the GuC only has 4 priority bands. The highest priority
> > +band is reserved with the kernel. This aligns with the DRM scheduler priority
> > +levels too.
>
> Please Cc: mesa and get an ack from Jason Ekstrand or Ken Graunke on this,
> just to be sure.

A reference to the actual specs this targets would help. I don't have
oneAPI to hand if it's relevant, but the two in graphics world are
https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
and https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap5.html#devsandqueues-priority
- both of them pretty much say that the implementation may do anything
or nothing at all, so this isn't a problem for spec conformance, only
a matter of user priority (sorry).

Cheers,
Daniel
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
@ 2021-05-11 15:11     ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 15:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> i915_drm.h updates for 'set parallel submit' extension.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 126 insertions(+)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 26d2e135aa31..0175b12b33b8 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
>   * Extensions:
>   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)

Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
also not sure this formats very well as a nice list. Using item lists
should look pretty nice like we're doing for the various kms properties,
e.g.

FOO:
  Explain what FOO does

BAR:
  Explain what BAR does. struct bar also automatically generates a link

Please check with make htmldocs and polish this a bit (might need a small
prep patch).

>   */
>  #define I915_CONTEXT_PARAM_ENGINES	0xa
>  
> @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
>  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>  	struct i915_engine_class_instance engines[0];
>  } __attribute__((packed));
>  
> +/*
> + * i915_context_engines_parallel_submit:
> + *
> + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> + *
> + * All hardware contexts in the engine set are configured for parallel
> + * submission (i.e. once this gem context is configured for parallel submission,
> + * all the hardware contexts, regardless if a BB is available on each individual
> + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> + * recommended as it may reserve physical engines with nothing to run on them.
> + * Highly recommended to configure the gem context with N hardware contexts then
> + * always submit N BBs in a single IOCTL.
> + *
> + * Their are two currently defined ways to control the placement of the
> + * hardware contexts on physical engines: default behavior (no flags) and
> + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> + * future as new hardware / use cases arise. Details of how to use this
> + * interface below above the flags.
> + *
> + * Returns -EINVAL if hardware context placement configuration invalid or if the
> + * placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * inteface.
> + */
> +struct i915_context_engines_parallel_submit {
> +	struct i915_user_extension base;

Ok this is good, since it makes sure we can't possible use this in
CTX_SETPARAM.

> +
> +/*
> + * Default placement behvavior (currently unsupported):
> + *
> + * Rather than restricting parallel submission to a single class with a
> + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> + * enables parallel submission across multiple engine classes. In this case each
> + * context's logical engine mask indicates where that context can placed. It is
> + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> + * if one context is running CS0 no other contexts can run on CS0).
> + *
> + * Example 1 pseudo code:
> + * CSX[Y] = engine class X, logical instance Y
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> + * set_parallel()
> + *
> + * Results in the following valid placements:
> + * CS0[0], CS1[0]
> + * CS0[0], CS1[1]
> + * CS0[1], CS1[0]
> + * CS0[1], CS1[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> + * set_parallel()
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[0], CS[2]
> + * CS[1], CS[0]
> + * CS[1], CS[2]
> + * CS[2], CS[0]
> + * CS[2], CS[1]
> + *
> + * This enables a use case where all engines are created equally, we don't care
> + * where they are scheduled, we just want a certain number of resources, for
> + * those resources to be scheduled in parallel, and possibly across multiple
> + * engine classes.
> + */
> +
> +/*
> + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> + * Each context must have the same number sibling and bonds are implictly create
> + * of the siblings.
> + *
> + * All of the below examples are in logical space.
> + *
> + * Example 1 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * set_engines(CS[0], CS[1])
> + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[2], CS[3]
> + *
> + * This enables a use case where all engines are not equal and certain placement
> + * rules are required (i.e. split-frame requires all contexts to be placed in a
> + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> + * case (logically contiguous placement, within a single engine class) is
> + * supported when using GuC submission. Execlist mode could support all possible
> + * bonding configurations but currently doesn't support this extension.
> + */
> +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> +/*
> + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> + * points on all hardware contexts between each set of BBs. An example use case
> + * of this feature is split-frame on gen11+ hardware. When using this feature a
> + * BB must be submitted on each hardware context in the parallel gem context.
> + * The execbuf2 IOCTL enforces the user adheres to policy.
> + */
> +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> +	__u64 flags; /* all undefined flags must be zero */
> +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> +} __attribute__ ((packed));

Ok I'm having some serious questions. This looks way too much like it's
inspired by bonded submission, and given we're tossing bonded submission
we need to make sure we're doing this for good independent reasons and not
just for intertia.

What I expected looking at how media-driver uses bonded submit currently
is:

- We create a parallel submit engine, which occupies a virtual engine
  slot. This parallel virtual engine contains all the information we need,
  i.e. the flags you have above, but also how many engines run in parallel
  and how each of those can be load-balanced. So probably a full NxM
  matrix of physical engines needed.

- Execbuf uses that parallel virtual engine to submit all N batchbuffers
  in one go.

- This means we don't create virtual engines (or physical engine mappings)
  for all the individual pieces in a parallel engine. That's a concept
  from bonded submission, and I think that needs to go.

- More important not having a parallel virtual engine breaks our already
  badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
  is just a container object, which points at a bunch of engines (plus the
  VM and a few other things). Having parallel context something that sits
  at the gem ctx level, and not as an individual engine (of which you can
  have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
  sits at the gem ctx level, so that you can capture all the perf data for
  an entire workload spawning across multiple engines. If a workload now
  needs multiple parallel engines we'd need multiple gem ctx, which breaks
  this.

So what I'd expect we'd have here is roughly:

struct i915_context_engines_parallel_submit {
	struct i915_user_extension base;
	__u64 flags;
	__u32 num_engines; /* N, must match what we submit in the execbuf */
	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
};

If we then also require that you always submit the full width of N
batchbuffers then even the execbuf extension doesn't need to exist
anymore, because the virtual parallel engine already contains all the
needed information.

And sure for some backends at least (definitely execlist) we'd need to
create a bunch of additional virtual engines behind that virtual engine.
But they'd be entirely hidden, and not visible to userspace nor the higher
levels.

What am I missing?
-Daniel

>  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
>  	__u64 extensions; \
>  	struct i915_engine_class_instance engines[N__]; \
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-11 15:11     ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 15:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> i915_drm.h updates for 'set parallel submit' extension.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 126 insertions(+)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 26d2e135aa31..0175b12b33b8 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
>   * Extensions:
>   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)

Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
also not sure this formats very well as a nice list. Using item lists
should look pretty nice like we're doing for the various kms properties,
e.g.

FOO:
  Explain what FOO does

BAR:
  Explain what BAR does. struct bar also automatically generates a link

Please check with make htmldocs and polish this a bit (might need a small
prep patch).

>   */
>  #define I915_CONTEXT_PARAM_ENGINES	0xa
>  
> @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
>  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
>  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
>  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
>  	struct i915_engine_class_instance engines[0];
>  } __attribute__((packed));
>  
> +/*
> + * i915_context_engines_parallel_submit:
> + *
> + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> + *
> + * All hardware contexts in the engine set are configured for parallel
> + * submission (i.e. once this gem context is configured for parallel submission,
> + * all the hardware contexts, regardless if a BB is available on each individual
> + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> + * recommended as it may reserve physical engines with nothing to run on them.
> + * Highly recommended to configure the gem context with N hardware contexts then
> + * always submit N BBs in a single IOCTL.
> + *
> + * Their are two currently defined ways to control the placement of the
> + * hardware contexts on physical engines: default behavior (no flags) and
> + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> + * future as new hardware / use cases arise. Details of how to use this
> + * interface below above the flags.
> + *
> + * Returns -EINVAL if hardware context placement configuration invalid or if the
> + * placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * inteface.
> + */
> +struct i915_context_engines_parallel_submit {
> +	struct i915_user_extension base;

Ok this is good, since it makes sure we can't possible use this in
CTX_SETPARAM.

> +
> +/*
> + * Default placement behvavior (currently unsupported):
> + *
> + * Rather than restricting parallel submission to a single class with a
> + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> + * enables parallel submission across multiple engine classes. In this case each
> + * context's logical engine mask indicates where that context can placed. It is
> + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> + * if one context is running CS0 no other contexts can run on CS0).
> + *
> + * Example 1 pseudo code:
> + * CSX[Y] = engine class X, logical instance Y
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> + * set_parallel()
> + *
> + * Results in the following valid placements:
> + * CS0[0], CS1[0]
> + * CS0[0], CS1[1]
> + * CS0[1], CS1[0]
> + * CS0[1], CS1[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> + * set_parallel()
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[0], CS[2]
> + * CS[1], CS[0]
> + * CS[1], CS[2]
> + * CS[2], CS[0]
> + * CS[2], CS[1]
> + *
> + * This enables a use case where all engines are created equally, we don't care
> + * where they are scheduled, we just want a certain number of resources, for
> + * those resources to be scheduled in parallel, and possibly across multiple
> + * engine classes.
> + */
> +
> +/*
> + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> + * Each context must have the same number sibling and bonds are implictly create
> + * of the siblings.
> + *
> + * All of the below examples are in logical space.
> + *
> + * Example 1 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * set_engines(CS[0], CS[1])
> + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID, INVALID)
> + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[2], CS[3]
> + *
> + * This enables a use case where all engines are not equal and certain placement
> + * rules are required (i.e. split-frame requires all contexts to be placed in a
> + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> + * case (logically contiguous placement, within a single engine class) is
> + * supported when using GuC submission. Execlist mode could support all possible
> + * bonding configurations but currently doesn't support this extension.
> + */
> +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> +/*
> + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> + * points on all hardware contexts between each set of BBs. An example use case
> + * of this feature is split-frame on gen11+ hardware. When using this feature a
> + * BB must be submitted on each hardware context in the parallel gem context.
> + * The execbuf2 IOCTL enforces the user adheres to policy.
> + */
> +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> +	__u64 flags; /* all undefined flags must be zero */
> +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> +} __attribute__ ((packed));

Ok I'm having some serious questions. This looks way too much like it's
inspired by bonded submission, and given we're tossing bonded submission
we need to make sure we're doing this for good independent reasons and not
just for intertia.

What I expected looking at how media-driver uses bonded submit currently
is:

- We create a parallel submit engine, which occupies a virtual engine
  slot. This parallel virtual engine contains all the information we need,
  i.e. the flags you have above, but also how many engines run in parallel
  and how each of those can be load-balanced. So probably a full NxM
  matrix of physical engines needed.

- Execbuf uses that parallel virtual engine to submit all N batchbuffers
  in one go.

- This means we don't create virtual engines (or physical engine mappings)
  for all the individual pieces in a parallel engine. That's a concept
  from bonded submission, and I think that needs to go.

- More important not having a parallel virtual engine breaks our already
  badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
  is just a container object, which points at a bunch of engines (plus the
  VM and a few other things). Having parallel context something that sits
  at the gem ctx level, and not as an individual engine (of which you can
  have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
  sits at the gem ctx level, so that you can capture all the perf data for
  an entire workload spawning across multiple engines. If a workload now
  needs multiple parallel engines we'd need multiple gem ctx, which breaks
  this.

So what I'd expect we'd have here is roughly:

struct i915_context_engines_parallel_submit {
	struct i915_user_extension base;
	__u64 flags;
	__u32 num_engines; /* N, must match what we submit in the execbuf */
	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
};

If we then also require that you always submit the full width of N
batchbuffers then even the execbuf extension doesn't need to exist
anymore, because the virtual parallel engine already contains all the
needed information.

And sure for some backends at least (definitely execlist) we'd need to
create a bunch of additional virtual engines behind that virtual engine.
But they'd be entirely hidden, and not visible to userspace nor the higher
levels.

What am I missing?
-Daniel

>  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
>  	__u64 extensions; \
>  	struct i915_engine_class_instance engines[N__]; \
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
  2021-05-11 14:58       ` Daniel Stone
@ 2021-05-11 15:12         ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 15:12 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Matthew Brost, intel-gfx, dri-devel, carl.zhang, Jason Ekstrand,
	Vetter, Daniel

On Tue, May 11, 2021 at 03:58:43PM +0100, Daniel Stone wrote:
> Hi,
> 
> On Tue, 11 May 2021 at 15:34, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Thu, May 06, 2021 at 10:30:45AM -0700, Matthew Brost wrote:
> > > +No major changes are required to the uAPI for basic GuC submission. The only
> > > +change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
> > > +This attribute indicates the 2k i915 user priority levels are statically mapped
> > > +into 3 levels as follows:
> > > +
> > > +* -1k to -1 Low priority
> > > +* 0 Medium priority
> > > +* 1 to 1k High priority
> > > +
> > > +This is needed because the GuC only has 4 priority bands. The highest priority
> > > +band is reserved with the kernel. This aligns with the DRM scheduler priority
> > > +levels too.
> >
> > Please Cc: mesa and get an ack from Jason Ekstrand or Ken Graunke on this,
> > just to be sure.
> 
> A reference to the actual specs this targets would help. I don't have
> oneAPI to hand if it's relevant, but the two in graphics world are
> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
> and https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap5.html#devsandqueues-priority
> - both of them pretty much say that the implementation may do anything
> or nothing at all, so this isn't a problem for spec conformance, only
> a matter of user priority (sorry).

Good point, Matt please also include the level0 spec here (aside from
egl/vk extensions). Might need to ping Michal Mrozek internally and cc:
him on this one here too.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan
@ 2021-05-11 15:12         ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 15:12 UTC (permalink / raw)
  To: Daniel Stone
  Cc: intel-gfx, dri-devel, carl.zhang, Jason Ekstrand, Vetter, Daniel

On Tue, May 11, 2021 at 03:58:43PM +0100, Daniel Stone wrote:
> Hi,
> 
> On Tue, 11 May 2021 at 15:34, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Thu, May 06, 2021 at 10:30:45AM -0700, Matthew Brost wrote:
> > > +No major changes are required to the uAPI for basic GuC submission. The only
> > > +change is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.
> > > +This attribute indicates the 2k i915 user priority levels are statically mapped
> > > +into 3 levels as follows:
> > > +
> > > +* -1k to -1 Low priority
> > > +* 0 Medium priority
> > > +* 1 to 1k High priority
> > > +
> > > +This is needed because the GuC only has 4 priority bands. The highest priority
> > > +band is reserved with the kernel. This aligns with the DRM scheduler priority
> > > +levels too.
> >
> > Please Cc: mesa and get an ack from Jason Ekstrand or Ken Graunke on this,
> > just to be sure.
> 
> A reference to the actual specs this targets would help. I don't have
> oneAPI to hand if it's relevant, but the two in graphics world are
> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
> and https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap5.html#devsandqueues-priority
> - both of them pretty much say that the implementation may do anything
> or nothing at all, so this isn't a problem for spec conformance, only
> a matter of user priority (sorry).

Good point, Matt please also include the level0 spec here (aside from
egl/vk extensions). Might need to ping Michal Mrozek internally and cc:
him on this one here too.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs
  2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
@ 2021-05-11 15:13     ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 15:13 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:49AM -0700, Matthew Brost wrote:
> Add I915_EXEC_NUMBER_BB_* to drm_i915_gem_execbuffer2.flags which allows
> submitting N BBs per IOCTL.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

I dropped my big question on the previous patch already, I'll check this
out again when it's all squashed into the parallel extension patch so we
have everything in one commit.
-Daniel

> ---
>  include/uapi/drm/i915_drm.h | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 0175b12b33b8..d3072cad4a7e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1291,7 +1291,26 @@ struct drm_i915_gem_execbuffer2 {
>   */
>  #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
>  
> -#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
> +/*
> + * Number of BB in execbuf2 IOCTL - 1, used to submit more than BB in a single
> + * execbuf2 IOCTL.
> + *
> + * Return -EINVAL if more than 1 BB (value 0) is specified if
> + * I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT hasn't been called on the gem
> + * context first. Also returns -EINVAL if gem context has been setup with
> + * I915_PARALLEL_NO_PREEMPT_MID_BATCH and the number BBs not equal to the total
> + * number hardware contexts in the gem context.
> + */
> +#define I915_EXEC_NUMBER_BB_LSB		(22)
> +#define I915_EXEC_NUMBER_BB_MASK	(0x3f << I915_EXEC_NUMBER_BB_LSB)
> +#define I915_EXEC_NUMBER_BB_MSB		(27)
> +#define i915_execbuffer2_set_number_bb(eb2, num_bb) \
> +	(eb2).flags = ((eb2).flags & ~I915_EXEC_NUMBER_BB_MASK) | \
> +	(((num_bb - 1) << I915_EXEC_NUMBER_BB_LSB) & I915_EXEC_NUMBER_BB_MASK)
> +#define i915_execbuffer2_get_number_bb(eb2) \
> +	((((eb2).flags & I915_EXEC_NUMBER_BB_MASK) >> I915_EXEC_NUMBER_BB_LSB) + 1)
> +
> +#define __I915_EXEC_UNKNOWN_FLAGS (-(1 << (I915_EXEC_NUMBER_BB_MSB + 1)))
>  
>  #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
>  #define i915_execbuffer2_set_context_id(eb2, context) \
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs
@ 2021-05-11 15:13     ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-11 15:13 UTC (permalink / raw)
  To: Matthew Brost
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Thu, May 06, 2021 at 10:30:49AM -0700, Matthew Brost wrote:
> Add I915_EXEC_NUMBER_BB_* to drm_i915_gem_execbuffer2.flags which allows
> submitting N BBs per IOCTL.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> CC: Carl Zhang <carl.zhang@intel.com>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

I dropped my big question on the previous patch already, I'll check this
out again when it's all squashed into the parallel extension patch so we
have everything in one commit.
-Daniel

> ---
>  include/uapi/drm/i915_drm.h | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 0175b12b33b8..d3072cad4a7e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1291,7 +1291,26 @@ struct drm_i915_gem_execbuffer2 {
>   */
>  #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
>  
> -#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
> +/*
> + * Number of BB in execbuf2 IOCTL - 1, used to submit more than BB in a single
> + * execbuf2 IOCTL.
> + *
> + * Return -EINVAL if more than 1 BB (value 0) is specified if
> + * I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT hasn't been called on the gem
> + * context first. Also returns -EINVAL if gem context has been setup with
> + * I915_PARALLEL_NO_PREEMPT_MID_BATCH and the number BBs not equal to the total
> + * number hardware contexts in the gem context.
> + */
> +#define I915_EXEC_NUMBER_BB_LSB		(22)
> +#define I915_EXEC_NUMBER_BB_MASK	(0x3f << I915_EXEC_NUMBER_BB_LSB)
> +#define I915_EXEC_NUMBER_BB_MSB		(27)
> +#define i915_execbuffer2_set_number_bb(eb2, num_bb) \
> +	(eb2).flags = ((eb2).flags & ~I915_EXEC_NUMBER_BB_MASK) | \
> +	(((num_bb - 1) << I915_EXEC_NUMBER_BB_LSB) & I915_EXEC_NUMBER_BB_MASK)
> +#define i915_execbuffer2_get_number_bb(eb2) \
> +	((((eb2).flags & I915_EXEC_NUMBER_BB_MASK) >> I915_EXEC_NUMBER_BB_LSB) + 1)
> +
> +#define __I915_EXEC_UNKNOWN_FLAGS (-(1 << (I915_EXEC_NUMBER_BB_MSB + 1)))
>  
>  #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
>  #define i915_execbuffer2_set_context_id(eb2, context) \
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan
  2021-05-11 14:49     ` Daniel Vetter
@ 2021-05-11 17:51       ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-11 17:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Tue, May 11, 2021 at 04:49:58PM +0200, Daniel Vetter wrote:
> On Thu, May 06, 2021 at 10:30:46AM -0700, Matthew Brost wrote:
> > Add entry fpr i915 new parallel submission uAPI plan.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Tony Ye <tony.ye@intel.com>
> > CC: Carl Zhang <carl.zhang@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  Documentation/gpu/rfc/i915_scheduler.rst | 56 +++++++++++++++++++++++-
> >  1 file changed, 54 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> > index fa6780a11c86..e3455b33edfe 100644
> > --- a/Documentation/gpu/rfc/i915_scheduler.rst
> > +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> > @@ -13,7 +13,8 @@ i915 with the DRM scheduler is:
> >  	  modparam enable_guc
> >  	* Lots of rework will need to be done to integrate with DRM scheduler so
> >  	  no need to nit pick everything in the code, it just should be
> > -	  functional and not regress execlists
> > +	  functional, no major coding style / layering errors, and not regress
> > +	  execlists
> 
> I guess this hunk should be in the previous patch?
> 

Yep, noticed this after sending.

> >  	* Update IGTs / selftests as needed to work with GuC submission
> >  	* Enable CI on supported platforms for a baseline
> >  	* Rework / get CI heathly for GuC submission in place as needed
> > @@ -67,4 +68,55 @@ levels too.
> >  
> >  New parallel submission uAPI
> >  ============================
> > -Details to come in a following patch.
> > +The existing bonding uAPI is completely broken with GuC submission because
> > +whether a submission is a single context submit or parallel submit isn't known
> > +until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
> > +contexts in parallel with the GuC the context must be explictly registered with
> > +N contexts and all N contexts must be submitted in a single command to the GuC.
> > +This interfaces doesn't support dynamically changing between N contexts as the
> > +bonding uAPI does. Hence the need for a new parallel submission interface. Also
> > +the legacy bonding uAPI is quite confusing and not intuitive at all.
> 
> I think you should sit together with Jason on irc or so for a bit and get
> an earful of how it's all broken irrespective of GuC submission or not.
> Just to hammer in our case :-)
>

Sounds like a fun conversation, will do.
 
> > +
> > +The new parallel submission uAPI consists of 3 parts:
> > +
> > +* Export engines logical mapping
> > +* A 'set_parallel' extension to configure contexts for parallel
> > +  submission
> > +* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> > +
> > +Export engines logical mapping
> > +------------------------------
> > +Certain use cases require BBs to be placed on engine instances in logical order
> > +(e.g. split-frame on gen11+). The logical mapping of engine instances can change
> > +based on fusing. Rather than making UMDs be aware of fusing, simply expose the
> > +logical mapping with the existing query engine info IOCTL. Also the GuC
> > +submission interface currently only supports submitting multiple contexts to
> > +engines in logical order.
> 
> Maybe highlight more that this is a new restriction with GuC compared to
> execlist, which is why we need to expose this information to userspace.
> Also on the platforms thus far supported in upstream there's at most 2
> engines of the same type, so really not an issue.
>

Sure. This is a limitation of the GuC interface + really isn't needed unless we
have more than 2 engines of the same type.
 
> > +
> > +A single bit will be added to drm_i915_engine_info.flags indicating that the
> > +logical instance has been returned and a new field,
> > +drm_i915_engine_info.logical_instance, returns the logical instance.
> > +
> > +A 'set_parallel' extension to configure contexts for parallel submission
> > +------------------------------------------------------------------------
> > +The 'set_parallel' extension configures N contexts for parallel submission. It
> > +is setup step that should be called before using any of the contexts. See
> > +I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
> > +similar existing examples. Once the N contexts are configured for parallel
> > +submission the execbuf2 IOCTL can be called submiting 1-N BBs in a single IOCTL.
> > +Although submitting less than N BBs is allowed it is not recommended as that
> > +will likely leave parts of the hardware reserved and idle. Initially only
> > +support GuC submission. Execlist support can be added later if needed.
> 
> Can we just require that you always submit N batchbuffers, or does this
> create a problem for userspace? Allowing things just because is generally
> not a good idea with uapi, it's better to limit and then allow when
> there's a need.
>

Yes, we can limit the submit to N batchbuffers. In fact I want too. I think 1-N
is a layover from our internal discussions where we wanted this interface to be
able to do everything and anything. 
 
> Ofc if we already have a need then explain why and that's all fine.
> 
> Also detailed comments on the kerneldoc I'll do in the next patches.
> 
> > +
> > +Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
> > +i915_context_engines_parallel_submit to the uAPI to implement this extension.
> > +
> > +Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> > +-------------------------------------------------------------------
> > +Contexts that have been configured with the 'set_parallel' extension are allowed
> > +to submit 1-N BBs in a single execbuf2 IOCTL. The BBs are either the last N
> > +objects in the drm_i915_gem_exec_object2 list or the first N if
> > +I915_EXEC_BATCH_FIRST is set.
> > +
> > +Add field 6 bit wide field to drm_i915_gem_exec_object2.flags which indicates
> > +the number of BBs - 1 included in the IOCTL.
> 
> Hm we have the nice execbuf extension chaining, any reason for not using
> that and instead opting for clever field packing?
>

I think we just drop this per the comments above. If we only allow N batch
buffers on a contexts configured with 'set_parallel' we really don't need to pass
in the number of buffers do we?

Matt
 
> Cheers, Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan
@ 2021-05-11 17:51       ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-11 17:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Tue, May 11, 2021 at 04:49:58PM +0200, Daniel Vetter wrote:
> On Thu, May 06, 2021 at 10:30:46AM -0700, Matthew Brost wrote:
> > Add entry fpr i915 new parallel submission uAPI plan.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Tony Ye <tony.ye@intel.com>
> > CC: Carl Zhang <carl.zhang@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  Documentation/gpu/rfc/i915_scheduler.rst | 56 +++++++++++++++++++++++-
> >  1 file changed, 54 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/gpu/rfc/i915_scheduler.rst b/Documentation/gpu/rfc/i915_scheduler.rst
> > index fa6780a11c86..e3455b33edfe 100644
> > --- a/Documentation/gpu/rfc/i915_scheduler.rst
> > +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> > @@ -13,7 +13,8 @@ i915 with the DRM scheduler is:
> >  	  modparam enable_guc
> >  	* Lots of rework will need to be done to integrate with DRM scheduler so
> >  	  no need to nit pick everything in the code, it just should be
> > -	  functional and not regress execlists
> > +	  functional, no major coding style / layering errors, and not regress
> > +	  execlists
> 
> I guess this hunk should be in the previous patch?
> 

Yep, noticed this after sending.

> >  	* Update IGTs / selftests as needed to work with GuC submission
> >  	* Enable CI on supported platforms for a baseline
> >  	* Rework / get CI heathly for GuC submission in place as needed
> > @@ -67,4 +68,55 @@ levels too.
> >  
> >  New parallel submission uAPI
> >  ============================
> > -Details to come in a following patch.
> > +The existing bonding uAPI is completely broken with GuC submission because
> > +whether a submission is a single context submit or parallel submit isn't known
> > +until execbuf time activated via the I915_SUBMIT_FENCE. To submit multiple
> > +contexts in parallel with the GuC the context must be explictly registered with
> > +N contexts and all N contexts must be submitted in a single command to the GuC.
> > +This interfaces doesn't support dynamically changing between N contexts as the
> > +bonding uAPI does. Hence the need for a new parallel submission interface. Also
> > +the legacy bonding uAPI is quite confusing and not intuitive at all.
> 
> I think you should sit together with Jason on irc or so for a bit and get
> an earful of how it's all broken irrespective of GuC submission or not.
> Just to hammer in our case :-)
>

Sounds like a fun conversation, will do.
 
> > +
> > +The new parallel submission uAPI consists of 3 parts:
> > +
> > +* Export engines logical mapping
> > +* A 'set_parallel' extension to configure contexts for parallel
> > +  submission
> > +* Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> > +
> > +Export engines logical mapping
> > +------------------------------
> > +Certain use cases require BBs to be placed on engine instances in logical order
> > +(e.g. split-frame on gen11+). The logical mapping of engine instances can change
> > +based on fusing. Rather than making UMDs be aware of fusing, simply expose the
> > +logical mapping with the existing query engine info IOCTL. Also the GuC
> > +submission interface currently only supports submitting multiple contexts to
> > +engines in logical order.
> 
> Maybe highlight more that this is a new restriction with GuC compared to
> execlist, which is why we need to expose this information to userspace.
> Also on the platforms thus far supported in upstream there's at most 2
> engines of the same type, so really not an issue.
>

Sure. This is a limitation of the GuC interface + really isn't needed unless we
have more than 2 engines of the same type.
 
> > +
> > +A single bit will be added to drm_i915_engine_info.flags indicating that the
> > +logical instance has been returned and a new field,
> > +drm_i915_engine_info.logical_instance, returns the logical instance.
> > +
> > +A 'set_parallel' extension to configure contexts for parallel submission
> > +------------------------------------------------------------------------
> > +The 'set_parallel' extension configures N contexts for parallel submission. It
> > +is setup step that should be called before using any of the contexts. See
> > +I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND for
> > +similar existing examples. Once the N contexts are configured for parallel
> > +submission the execbuf2 IOCTL can be called submiting 1-N BBs in a single IOCTL.
> > +Although submitting less than N BBs is allowed it is not recommended as that
> > +will likely leave parts of the hardware reserved and idle. Initially only
> > +support GuC submission. Execlist support can be added later if needed.
> 
> Can we just require that you always submit N batchbuffers, or does this
> create a problem for userspace? Allowing things just because is generally
> not a good idea with uapi, it's better to limit and then allow when
> there's a need.
>

Yes, we can limit the submit to N batchbuffers. In fact I want too. I think 1-N
is a layover from our internal discussions where we wanted this interface to be
able to do everything and anything. 
 
> Ofc if we already have a need then explain why and that's all fine.
> 
> Also detailed comments on the kerneldoc I'll do in the next patches.
> 
> > +
> > +Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT and
> > +i915_context_engines_parallel_submit to the uAPI to implement this extension.
> > +
> > +Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL
> > +-------------------------------------------------------------------
> > +Contexts that have been configured with the 'set_parallel' extension are allowed
> > +to submit 1-N BBs in a single execbuf2 IOCTL. The BBs are either the last N
> > +objects in the drm_i915_gem_exec_object2 list or the first N if
> > +I915_EXEC_BATCH_FIRST is set.
> > +
> > +Add field 6 bit wide field to drm_i915_gem_exec_object2.flags which indicates
> > +the number of BBs - 1 included in the IOCTL.
> 
> Hm we have the nice execbuf extension chaining, any reason for not using
> that and instead opting for clever field packing?
>

I think we just drop this per the comments above. If we only allow N batch
buffers on a contexts configured with 'set_parallel' we really don't need to pass
in the number of buffers do we?

Matt
 
> Cheers, Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs
  2021-05-11 15:13     ` Daniel Vetter
@ 2021-05-11 18:01       ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-11 18:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Tue, May 11, 2021 at 05:13:54PM +0200, Daniel Vetter wrote:
> On Thu, May 06, 2021 at 10:30:49AM -0700, Matthew Brost wrote:
> > Add I915_EXEC_NUMBER_BB_* to drm_i915_gem_execbuffer2.flags which allows
> > submitting N BBs per IOCTL.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Tony Ye <tony.ye@intel.com>
> > CC: Carl Zhang <carl.zhang@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> I dropped my big question on the previous patch already, I'll check this
> out again when it's all squashed into the parallel extension patch so we
> have everything in one commit.

I think we just drop this and only allow N BBs per IOCTL as discussed in patch
#2 of this series.

Matt

> -Daniel
> 
> > ---
> >  include/uapi/drm/i915_drm.h | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 0175b12b33b8..d3072cad4a7e 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1291,7 +1291,26 @@ struct drm_i915_gem_execbuffer2 {
> >   */
> >  #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
> >  
> > -#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
> > +/*
> > + * Number of BB in execbuf2 IOCTL - 1, used to submit more than BB in a single
> > + * execbuf2 IOCTL.
> > + *
> > + * Return -EINVAL if more than 1 BB (value 0) is specified if
> > + * I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT hasn't been called on the gem
> > + * context first. Also returns -EINVAL if gem context has been setup with
> > + * I915_PARALLEL_NO_PREEMPT_MID_BATCH and the number BBs not equal to the total
> > + * number hardware contexts in the gem context.
> > + */
> > +#define I915_EXEC_NUMBER_BB_LSB		(22)
> > +#define I915_EXEC_NUMBER_BB_MASK	(0x3f << I915_EXEC_NUMBER_BB_LSB)
> > +#define I915_EXEC_NUMBER_BB_MSB		(27)
> > +#define i915_execbuffer2_set_number_bb(eb2, num_bb) \
> > +	(eb2).flags = ((eb2).flags & ~I915_EXEC_NUMBER_BB_MASK) | \
> > +	(((num_bb - 1) << I915_EXEC_NUMBER_BB_LSB) & I915_EXEC_NUMBER_BB_MASK)
> > +#define i915_execbuffer2_get_number_bb(eb2) \
> > +	((((eb2).flags & I915_EXEC_NUMBER_BB_MASK) >> I915_EXEC_NUMBER_BB_LSB) + 1)
> > +
> > +#define __I915_EXEC_UNKNOWN_FLAGS (-(1 << (I915_EXEC_NUMBER_BB_MSB + 1)))
> >  
> >  #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
> >  #define i915_execbuffer2_set_context_id(eb2, context) \
> > -- 
> > 2.28.0
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs
@ 2021-05-11 18:01       ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-11 18:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Tue, May 11, 2021 at 05:13:54PM +0200, Daniel Vetter wrote:
> On Thu, May 06, 2021 at 10:30:49AM -0700, Matthew Brost wrote:
> > Add I915_EXEC_NUMBER_BB_* to drm_i915_gem_execbuffer2.flags which allows
> > submitting N BBs per IOCTL.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Tony Ye <tony.ye@intel.com>
> > CC: Carl Zhang <carl.zhang@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> I dropped my big question on the previous patch already, I'll check this
> out again when it's all squashed into the parallel extension patch so we
> have everything in one commit.

I think we just drop this and only allow N BBs per IOCTL as discussed in patch
#2 of this series.

Matt

> -Daniel
> 
> > ---
> >  include/uapi/drm/i915_drm.h | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 0175b12b33b8..d3072cad4a7e 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1291,7 +1291,26 @@ struct drm_i915_gem_execbuffer2 {
> >   */
> >  #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
> >  
> > -#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
> > +/*
> > + * Number of BB in execbuf2 IOCTL - 1, used to submit more than BB in a single
> > + * execbuf2 IOCTL.
> > + *
> > + * Return -EINVAL if more than 1 BB (value 0) is specified if
> > + * I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT hasn't been called on the gem
> > + * context first. Also returns -EINVAL if gem context has been setup with
> > + * I915_PARALLEL_NO_PREEMPT_MID_BATCH and the number BBs not equal to the total
> > + * number hardware contexts in the gem context.
> > + */
> > +#define I915_EXEC_NUMBER_BB_LSB		(22)
> > +#define I915_EXEC_NUMBER_BB_MASK	(0x3f << I915_EXEC_NUMBER_BB_LSB)
> > +#define I915_EXEC_NUMBER_BB_MSB		(27)
> > +#define i915_execbuffer2_set_number_bb(eb2, num_bb) \
> > +	(eb2).flags = ((eb2).flags & ~I915_EXEC_NUMBER_BB_MASK) | \
> > +	(((num_bb - 1) << I915_EXEC_NUMBER_BB_LSB) & I915_EXEC_NUMBER_BB_MASK)
> > +#define i915_execbuffer2_get_number_bb(eb2) \
> > +	((((eb2).flags & I915_EXEC_NUMBER_BB_MASK) >> I915_EXEC_NUMBER_BB_LSB) + 1)
> > +
> > +#define __I915_EXEC_UNKNOWN_FLAGS (-(1 << (I915_EXEC_NUMBER_BB_MSB + 1)))
> >  
> >  #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
> >  #define i915_execbuffer2_set_context_id(eb2, context) \
> > -- 
> > 2.28.0
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-11 15:11     ` Daniel Vetter
@ 2021-05-11 18:44       ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-11 18:44 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > i915_drm.h updates for 'set parallel submit' extension.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Tony Ye <tony.ye@intel.com>
> > CC: Carl Zhang <carl.zhang@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 126 insertions(+)
> > 
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 26d2e135aa31..0175b12b33b8 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> >   * Extensions:
> >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> 
> Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> also not sure this formats very well as a nice list. Using item lists
> should look pretty nice like we're doing for the various kms properties,
> e.g.
> 
> FOO:
>   Explain what FOO does
> 
> BAR:
>   Explain what BAR does. struct bar also automatically generates a link
> 
> Please check with make htmldocs and polish this a bit (might need a small
> prep patch).
> 

I agree the doc should look nice. To get there I might need to chat with you on
IRC as I'm new to this. 

> >   */
> >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> >  
> > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> >  	struct i915_engine_class_instance engines[0];
> >  } __attribute__((packed));
> >  
> > +/*
> > + * i915_context_engines_parallel_submit:
> > + *
> > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > + *
> > + * All hardware contexts in the engine set are configured for parallel
> > + * submission (i.e. once this gem context is configured for parallel submission,
> > + * all the hardware contexts, regardless if a BB is available on each individual
> > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > + * recommended as it may reserve physical engines with nothing to run on them.
> > + * Highly recommended to configure the gem context with N hardware contexts then
> > + * always submit N BBs in a single IOCTL.
> > + *
> > + * Their are two currently defined ways to control the placement of the
> > + * hardware contexts on physical engines: default behavior (no flags) and
> > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > + * future as new hardware / use cases arise. Details of how to use this
> > + * interface below above the flags.
> > + *
> > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > + * placement configuration isn't supported on the platform / submission
> > + * interface.
> > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > + * inteface.
> > + */
> > +struct i915_context_engines_parallel_submit {
> > +	struct i915_user_extension base;
> 
> Ok this is good, since it makes sure we can't possible use this in
> CTX_SETPARAM.
> 

Yep, this is at context creation time. Technically you still can call this over
and over on the same gem context but Jason is taking that ability away I
believe. I've also told the media team to setup the context once and don't touch
it again.

> > +
> > +/*
> > + * Default placement behvavior (currently unsupported):
> > + *
> > + * Rather than restricting parallel submission to a single class with a
> > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > + * enables parallel submission across multiple engine classes. In this case each
> > + * context's logical engine mask indicates where that context can placed. It is
> > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > + * if one context is running CS0 no other contexts can run on CS0).
> > + *
> > + * Example 1 pseudo code:
> > + * CSX[Y] = engine class X, logical instance Y
> > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + * set_engines(INVALID, INVALID)
> > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > + * set_parallel()
> > + *
> > + * Results in the following valid placements:
> > + * CS0[0], CS1[0]
> > + * CS0[0], CS1[1]
> > + * CS0[1], CS1[0]
> > + * CS0[1], CS1[1]
> > + *
> > + * Example 2 pseudo code:
> > + * CS[X] = generic engine of same class, logical instance X
> > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + * set_engines(INVALID, INVALID)
> > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > + * set_parallel()
> > + *
> > + * Results in the following valid placements:
> > + * CS[0], CS[1]
> > + * CS[0], CS[2]
> > + * CS[1], CS[0]
> > + * CS[1], CS[2]
> > + * CS[2], CS[0]
> > + * CS[2], CS[1]
> > + *
> > + * This enables a use case where all engines are created equally, we don't care
> > + * where they are scheduled, we just want a certain number of resources, for
> > + * those resources to be scheduled in parallel, and possibly across multiple
> > + * engine classes.
> > + */
> > +
> > +/*
> > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > + * Each context must have the same number sibling and bonds are implictly create
> > + * of the siblings.
> > + *
> > + * All of the below examples are in logical space.
> > + *
> > + * Example 1 pseudo code:
> > + * CS[X] = generic engine of same class, logical instance X
> > + * set_engines(CS[0], CS[1])
> > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > + *
> > + * Results in the following valid placements:
> > + * CS[0], CS[1]
> > + *
> > + * Example 2 pseudo code:
> > + * CS[X] = generic engine of same class, logical instance X
> > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + * set_engines(INVALID, INVALID)
> > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > + *
> > + * Results in the following valid placements:
> > + * CS[0], CS[1]
> > + * CS[2], CS[3]
> > + *
> > + * This enables a use case where all engines are not equal and certain placement
> > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > + * case (logically contiguous placement, within a single engine class) is
> > + * supported when using GuC submission. Execlist mode could support all possible
> > + * bonding configurations but currently doesn't support this extension.
> > + */
> > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > +/*
> > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > + * points on all hardware contexts between each set of BBs. An example use case
> > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > + * BB must be submitted on each hardware context in the parallel gem context.
> > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > + */
> > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > +	__u64 flags; /* all undefined flags must be zero */
> > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > +} __attribute__ ((packed));
> 
> Ok I'm having some serious questions. This looks way too much like it's
> inspired by bonded submission, and given we're tossing bonded submission
> we need to make sure we're doing this for good independent reasons and not
> just for intertia.
> 

You are not wrong here, the bonding submission interface was a factor in
designing this interface.

> What I expected looking at how media-driver uses bonded submit currently
> is:
> 
> - We create a parallel submit engine, which occupies a virtual engine
>   slot. This parallel virtual engine contains all the information we need,
>   i.e. the flags you have above, but also how many engines run in parallel
>   and how each of those can be load-balanced. So probably a full NxM
>   matrix of physical engines needed.
> 

Internally we need all this information broken out into individual structures,
at least with the current implementation. We need N ring buffers, N timelines, N
LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
which occupies a slot. Could we create a super object with N 'struct
intel_context', sure. I'm just not sure what that buys us and IMO creates an
inconsistent uAPI.

> - Execbuf uses that parallel virtual engine to submit all N batchbuffers
>   in one go.
> 

If we expose 1 or N engines it doesn't really matter, does it? Either way the
entire GEM context is configured for N BBs in a single IOCTL.

> - This means we don't create virtual engines (or physical engine mappings)
>   for all the individual pieces in a parallel engine. That's a concept
>   from bonded submission, and I think that needs to go.
> 

Again this isn't strickly true - we need N internal backing structures.

> - More important not having a parallel virtual engine breaks our already
>   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
>   is just a container object, which points at a bunch of engines (plus the
>   VM and a few other things). Having parallel context something that sits
>   at the gem ctx level, and not as an individual engine (of which you can
>   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
>   sits at the gem ctx level, so that you can capture all the perf data for
>   an entire workload spawning across multiple engines. If a workload now
>   needs multiple parallel engines we'd need multiple gem ctx, which breaks
>   this.

This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
love to fix this and changing a context to a single slot might be able to fix
this.

> 
> So what I'd expect we'd have here is roughly:
> 
> struct i915_context_engines_parallel_submit {
> 	struct i915_user_extension base;
> 	__u64 flags;
> 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> };
> 
> If we then also require that you always submit the full width of N
> batchbuffers then even the execbuf extension doesn't need to exist
> anymore, because the virtual parallel engine already contains all the
> needed information.
> 
> And sure for some backends at least (definitely execlist) we'd need to
> create a bunch of additional virtual engines behind that virtual engine.
> But they'd be entirely hidden, and not visible to userspace nor the higher
> levels.
>
> What am I missing?

Not really, I think you got it. I think at the end of day this really comes down
to do we want to allow more than 1 parallel virtual engine per gem context? If
the answer is yes we collapse a parallel virtual engine into a single slot, if
not we leave as is.

Matt

> -Daniel
> 
> >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> >  	__u64 extensions; \
> >  	struct i915_engine_class_instance engines[N__]; \
> > -- 
> > 2.28.0
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-11 18:44       ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-11 18:44 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > i915_drm.h updates for 'set parallel submit' extension.
> > 
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Tony Ye <tony.ye@intel.com>
> > CC: Carl Zhang <carl.zhang@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 126 insertions(+)
> > 
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 26d2e135aa31..0175b12b33b8 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> >   * Extensions:
> >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> 
> Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> also not sure this formats very well as a nice list. Using item lists
> should look pretty nice like we're doing for the various kms properties,
> e.g.
> 
> FOO:
>   Explain what FOO does
> 
> BAR:
>   Explain what BAR does. struct bar also automatically generates a link
> 
> Please check with make htmldocs and polish this a bit (might need a small
> prep patch).
> 

I agree the doc should look nice. To get there I might need to chat with you on
IRC as I'm new to this. 

> >   */
> >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> >  
> > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> >  	struct i915_engine_class_instance engines[0];
> >  } __attribute__((packed));
> >  
> > +/*
> > + * i915_context_engines_parallel_submit:
> > + *
> > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > + *
> > + * All hardware contexts in the engine set are configured for parallel
> > + * submission (i.e. once this gem context is configured for parallel submission,
> > + * all the hardware contexts, regardless if a BB is available on each individual
> > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > + * recommended as it may reserve physical engines with nothing to run on them.
> > + * Highly recommended to configure the gem context with N hardware contexts then
> > + * always submit N BBs in a single IOCTL.
> > + *
> > + * Their are two currently defined ways to control the placement of the
> > + * hardware contexts on physical engines: default behavior (no flags) and
> > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > + * future as new hardware / use cases arise. Details of how to use this
> > + * interface below above the flags.
> > + *
> > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > + * placement configuration isn't supported on the platform / submission
> > + * interface.
> > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > + * inteface.
> > + */
> > +struct i915_context_engines_parallel_submit {
> > +	struct i915_user_extension base;
> 
> Ok this is good, since it makes sure we can't possible use this in
> CTX_SETPARAM.
> 

Yep, this is at context creation time. Technically you still can call this over
and over on the same gem context but Jason is taking that ability away I
believe. I've also told the media team to setup the context once and don't touch
it again.

> > +
> > +/*
> > + * Default placement behvavior (currently unsupported):
> > + *
> > + * Rather than restricting parallel submission to a single class with a
> > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > + * enables parallel submission across multiple engine classes. In this case each
> > + * context's logical engine mask indicates where that context can placed. It is
> > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > + * if one context is running CS0 no other contexts can run on CS0).
> > + *
> > + * Example 1 pseudo code:
> > + * CSX[Y] = engine class X, logical instance Y
> > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + * set_engines(INVALID, INVALID)
> > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > + * set_parallel()
> > + *
> > + * Results in the following valid placements:
> > + * CS0[0], CS1[0]
> > + * CS0[0], CS1[1]
> > + * CS0[1], CS1[0]
> > + * CS0[1], CS1[1]
> > + *
> > + * Example 2 pseudo code:
> > + * CS[X] = generic engine of same class, logical instance X
> > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + * set_engines(INVALID, INVALID)
> > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > + * set_parallel()
> > + *
> > + * Results in the following valid placements:
> > + * CS[0], CS[1]
> > + * CS[0], CS[2]
> > + * CS[1], CS[0]
> > + * CS[1], CS[2]
> > + * CS[2], CS[0]
> > + * CS[2], CS[1]
> > + *
> > + * This enables a use case where all engines are created equally, we don't care
> > + * where they are scheduled, we just want a certain number of resources, for
> > + * those resources to be scheduled in parallel, and possibly across multiple
> > + * engine classes.
> > + */
> > +
> > +/*
> > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > + * Each context must have the same number sibling and bonds are implictly create
> > + * of the siblings.
> > + *
> > + * All of the below examples are in logical space.
> > + *
> > + * Example 1 pseudo code:
> > + * CS[X] = generic engine of same class, logical instance X
> > + * set_engines(CS[0], CS[1])
> > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > + *
> > + * Results in the following valid placements:
> > + * CS[0], CS[1]
> > + *
> > + * Example 2 pseudo code:
> > + * CS[X] = generic engine of same class, logical instance X
> > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > + * set_engines(INVALID, INVALID)
> > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > + *
> > + * Results in the following valid placements:
> > + * CS[0], CS[1]
> > + * CS[2], CS[3]
> > + *
> > + * This enables a use case where all engines are not equal and certain placement
> > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > + * case (logically contiguous placement, within a single engine class) is
> > + * supported when using GuC submission. Execlist mode could support all possible
> > + * bonding configurations but currently doesn't support this extension.
> > + */
> > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > +/*
> > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > + * points on all hardware contexts between each set of BBs. An example use case
> > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > + * BB must be submitted on each hardware context in the parallel gem context.
> > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > + */
> > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > +	__u64 flags; /* all undefined flags must be zero */
> > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > +} __attribute__ ((packed));
> 
> Ok I'm having some serious questions. This looks way too much like it's
> inspired by bonded submission, and given we're tossing bonded submission
> we need to make sure we're doing this for good independent reasons and not
> just for intertia.
> 

You are not wrong here, the bonding submission interface was a factor in
designing this interface.

> What I expected looking at how media-driver uses bonded submit currently
> is:
> 
> - We create a parallel submit engine, which occupies a virtual engine
>   slot. This parallel virtual engine contains all the information we need,
>   i.e. the flags you have above, but also how many engines run in parallel
>   and how each of those can be load-balanced. So probably a full NxM
>   matrix of physical engines needed.
> 

Internally we need all this information broken out into individual structures,
at least with the current implementation. We need N ring buffers, N timelines, N
LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
which occupies a slot. Could we create a super object with N 'struct
intel_context', sure. I'm just not sure what that buys us and IMO creates an
inconsistent uAPI.

> - Execbuf uses that parallel virtual engine to submit all N batchbuffers
>   in one go.
> 

If we expose 1 or N engines it doesn't really matter, does it? Either way the
entire GEM context is configured for N BBs in a single IOCTL.

> - This means we don't create virtual engines (or physical engine mappings)
>   for all the individual pieces in a parallel engine. That's a concept
>   from bonded submission, and I think that needs to go.
> 

Again this isn't strickly true - we need N internal backing structures.

> - More important not having a parallel virtual engine breaks our already
>   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
>   is just a container object, which points at a bunch of engines (plus the
>   VM and a few other things). Having parallel context something that sits
>   at the gem ctx level, and not as an individual engine (of which you can
>   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
>   sits at the gem ctx level, so that you can capture all the perf data for
>   an entire workload spawning across multiple engines. If a workload now
>   needs multiple parallel engines we'd need multiple gem ctx, which breaks
>   this.

This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
love to fix this and changing a context to a single slot might be able to fix
this.

> 
> So what I'd expect we'd have here is roughly:
> 
> struct i915_context_engines_parallel_submit {
> 	struct i915_user_extension base;
> 	__u64 flags;
> 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> };
> 
> If we then also require that you always submit the full width of N
> batchbuffers then even the execbuf extension doesn't need to exist
> anymore, because the virtual parallel engine already contains all the
> needed information.
> 
> And sure for some backends at least (definitely execlist) we'd need to
> create a bunch of additional virtual engines behind that virtual engine.
> But they'd be entirely hidden, and not visible to userspace nor the higher
> levels.
>
> What am I missing?

Not really, I think you got it. I think at the end of day this really comes down
to do we want to allow more than 1 parallel virtual engine per gem context? If
the answer is yes we collapse a parallel virtual engine into a single slot, if
not we leave as is.

Matt

> -Daniel
> 
> >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> >  	__u64 extensions; \
> >  	struct i915_engine_class_instance engines[N__]; \
> > -- 
> > 2.28.0
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-11 18:44       ` Matthew Brost
@ 2021-05-12  8:34         ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-12  8:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, carl.zhang, jason.ekstrand, daniel.vetter

On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > i915_drm.h updates for 'set parallel submit' extension.
> > > 
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: Tony Ye <tony.ye@intel.com>
> > > CC: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 126 insertions(+)
> > > 
> > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > index 26d2e135aa31..0175b12b33b8 100644
> > > --- a/include/uapi/drm/i915_drm.h
> > > +++ b/include/uapi/drm/i915_drm.h
> > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > >   * Extensions:
> > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > 
> > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > also not sure this formats very well as a nice list. Using item lists
> > should look pretty nice like we're doing for the various kms properties,
> > e.g.
> > 
> > FOO:
> >   Explain what FOO does
> > 
> > BAR:
> >   Explain what BAR does. struct bar also automatically generates a link
> > 
> > Please check with make htmldocs and polish this a bit (might need a small
> > prep patch).
> > 
> 
> I agree the doc should look nice. To get there I might need to chat with you on
> IRC as I'm new to this. 
> 
> > >   */
> > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > >  
> > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > >  	struct i915_engine_class_instance engines[0];
> > >  } __attribute__((packed));
> > >  
> > > +/*
> > > + * i915_context_engines_parallel_submit:
> > > + *
> > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > + *
> > > + * All hardware contexts in the engine set are configured for parallel
> > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > + * always submit N BBs in a single IOCTL.
> > > + *
> > > + * Their are two currently defined ways to control the placement of the
> > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > + * future as new hardware / use cases arise. Details of how to use this
> > > + * interface below above the flags.
> > > + *
> > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > + * placement configuration isn't supported on the platform / submission
> > > + * interface.
> > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > + * inteface.
> > > + */
> > > +struct i915_context_engines_parallel_submit {
> > > +	struct i915_user_extension base;
> > 
> > Ok this is good, since it makes sure we can't possible use this in
> > CTX_SETPARAM.
> > 
> 
> Yep, this is at context creation time. Technically you still can call this over
> and over on the same gem context but Jason is taking that ability away I
> believe. I've also told the media team to setup the context once and don't touch
> it again.

Only if you base your context param on drm_i915_gem_context_param, which
can be used both at create time with
drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
But you don't, so this issue is fixed at the uapi design and doesn't need
to interface with Jason's prot-ctx rework much.

There's still going to be some conflicts, so maybe ask Jason for a branch
and rebase GuC on top of that for the next round.

> 
> > > +
> > > +/*
> > > + * Default placement behvavior (currently unsupported):
> > > + *
> > > + * Rather than restricting parallel submission to a single class with a
> > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > + * enables parallel submission across multiple engine classes. In this case each
> > > + * context's logical engine mask indicates where that context can placed. It is
> > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > + * if one context is running CS0 no other contexts can run on CS0).
> > > + *
> > > + * Example 1 pseudo code:
> > > + * CSX[Y] = engine class X, logical instance Y
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID, INVALID)
> > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > + * set_parallel()
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS0[0], CS1[0]
> > > + * CS0[0], CS1[1]
> > > + * CS0[1], CS1[0]
> > > + * CS0[1], CS1[1]
> > > + *
> > > + * Example 2 pseudo code:
> > > + * CS[X] = generic engine of same class, logical instance X
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID, INVALID)
> > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > + * set_parallel()
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS[0], CS[1]
> > > + * CS[0], CS[2]
> > > + * CS[1], CS[0]
> > > + * CS[1], CS[2]
> > > + * CS[2], CS[0]
> > > + * CS[2], CS[1]
> > > + *
> > > + * This enables a use case where all engines are created equally, we don't care
> > > + * where they are scheduled, we just want a certain number of resources, for
> > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > + * engine classes.
> > > + */
> > > +
> > > +/*
> > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > + * Each context must have the same number sibling and bonds are implictly create
> > > + * of the siblings.
> > > + *
> > > + * All of the below examples are in logical space.
> > > + *
> > > + * Example 1 pseudo code:
> > > + * CS[X] = generic engine of same class, logical instance X
> > > + * set_engines(CS[0], CS[1])
> > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS[0], CS[1]
> > > + *
> > > + * Example 2 pseudo code:
> > > + * CS[X] = generic engine of same class, logical instance X
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID, INVALID)
> > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS[0], CS[1]
> > > + * CS[2], CS[3]
> > > + *
> > > + * This enables a use case where all engines are not equal and certain placement
> > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > + * case (logically contiguous placement, within a single engine class) is
> > > + * supported when using GuC submission. Execlist mode could support all possible
> > > + * bonding configurations but currently doesn't support this extension.
> > > + */
> > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > +/*
> > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > + * points on all hardware contexts between each set of BBs. An example use case
> > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > + */
> > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > +	__u64 flags; /* all undefined flags must be zero */
> > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > +} __attribute__ ((packed));
> > 
> > Ok I'm having some serious questions. This looks way too much like it's
> > inspired by bonded submission, and given we're tossing bonded submission
> > we need to make sure we're doing this for good independent reasons and not
> > just for intertia.
> > 
> 
> You are not wrong here, the bonding submission interface was a factor in
> designing this interface.
> 
> > What I expected looking at how media-driver uses bonded submit currently
> > is:
> > 
> > - We create a parallel submit engine, which occupies a virtual engine
> >   slot. This parallel virtual engine contains all the information we need,
> >   i.e. the flags you have above, but also how many engines run in parallel
> >   and how each of those can be load-balanced. So probably a full NxM
> >   matrix of physical engines needed.
> > 
> 
> Internally we need all this information broken out into individual structures,
> at least with the current implementation. We need N ring buffers, N timelines, N
> LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> which occupies a slot. Could we create a super object with N 'struct
> intel_context', sure. I'm just not sure what that buys us and IMO creates an
> inconsistent uAPI.

So if the implementation is too much work to adapt, here's a really nasty
trick: Currently we limit the engine slots to 64 in a gem context, because
that's the limit of the execbuf field. We could use the engine slots above
that for all these additional intel_context that we need underneath, at
least for execlist. Does GuC need them all too?

But clean approach would be to have an intel_parallal_engine struct which
has all these pointers internally I think.

Same on the high-level execbuf flow, doing all that N times is silly. So
again I'd assume there's one overall i915_request that tracks the parallel
submission, and then maybe N subordinate i915_request for each piece
(execlist backend definitely needs those for scheduling, I didn't check
about GuC).

Also drm/scheduler only deals with a single thing too, so that way the
high level code would never need to know that there's actually N things
underneath doing the job.

> > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> >   in one go.
> > 
> 
> If we expose 1 or N engines it doesn't really matter, does it? Either way the
> entire GEM context is configured for N BBs in a single IOCTL.
> 
> > - This means we don't create virtual engines (or physical engine mappings)
> >   for all the individual pieces in a parallel engine. That's a concept
> >   from bonded submission, and I think that needs to go.
> > 
> 
> Again this isn't strickly true - we need N internal backing structures.

I didn't check the code, but iirc you said for the GuC backend you do
nothing until the last submit. Only then it's pushed into the GuC. That
sounds a bit silly, and by treating parallel submission as a single thing
(which might or mightnot be split in lower levels) this would go away.

But it also might be way too much churn, because there's a bunch of places
where we have to do this splitting. If it's all, then maybe just keeping
the engines around everywhere makes sense.

But also this is leaking implementation details into uapi, from umd pov
it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
Leaking that we treat it as N engines underneath feels like a mistake.

> > - More important not having a parallel virtual engine breaks our already
> >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> >   is just a container object, which points at a bunch of engines (plus the
> >   VM and a few other things). Having parallel context something that sits
> >   at the gem ctx level, and not as an individual engine (of which you can
> >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> >   sits at the gem ctx level, so that you can capture all the perf data for
> >   an entire workload spawning across multiple engines. If a workload now
> >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> >   this.
> 
> This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> love to fix this and changing a context to a single slot might be able to fix
> this.

Yeah this is essentially the main gripe I have with this. Everywhere else
you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
submit, where you submit to a gem_ctx_id and the engine slot doesn't
matter. That's a rather unfortunate uapi.

Now with bonded submit this made some sense (not that bonded submit itself
made much sense), since you did indeed submit N batchbuffers to N
(gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
one execbuf call.

> > So what I'd expect we'd have here is roughly:
> > 
> > struct i915_context_engines_parallel_submit {
> > 	struct i915_user_extension base;
> > 	__u64 flags;
> > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > };
> > 
> > If we then also require that you always submit the full width of N
> > batchbuffers then even the execbuf extension doesn't need to exist
> > anymore, because the virtual parallel engine already contains all the
> > needed information.
> > 
> > And sure for some backends at least (definitely execlist) we'd need to
> > create a bunch of additional virtual engines behind that virtual engine.
> > But they'd be entirely hidden, and not visible to userspace nor the higher
> > levels.
> >
> > What am I missing?
> 
> Not really, I think you got it. I think at the end of day this really comes down
> to do we want to allow more than 1 parallel virtual engine per gem context? If
> the answer is yes we collapse a parallel virtual engine into a single slot, if
> not we leave as is.

Yup. So right now media uses one gem context per engine they need. Since
media doesn't care about perf/OA they could get shared VM by sharing the
VM across gem ctx, which they already do. So probably we could get away if
we leave parallel engines as a gem ctx level thing.

Also on the media-driver code the impact is nil since it's just a
different chain of context extensions in the same ioctl call.

Bigger picture is that Jason is quite unhappy withou our gem ctx based
uapi, and his long term idea is to make gem ctx into a pure container
object with pointers to engines and a vm. And not something that has
relevance itself. Currently that's not the case for perf/OA, which works
on the gem ctx, and Jason's already unhappy about that one. So adding more
stuff on the gem ctx level feels a bit like a mistake.

Cheers, Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > >  	__u64 extensions; \
> > >  	struct i915_engine_class_instance engines[N__]; \
> > > -- 
> > > 2.28.0
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-12  8:34         ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-12  8:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, carl.zhang, jason.ekstrand, daniel.vetter

On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > i915_drm.h updates for 'set parallel submit' extension.
> > > 
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: Tony Ye <tony.ye@intel.com>
> > > CC: Carl Zhang <carl.zhang@intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 126 insertions(+)
> > > 
> > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > index 26d2e135aa31..0175b12b33b8 100644
> > > --- a/include/uapi/drm/i915_drm.h
> > > +++ b/include/uapi/drm/i915_drm.h
> > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > >   * Extensions:
> > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > 
> > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > also not sure this formats very well as a nice list. Using item lists
> > should look pretty nice like we're doing for the various kms properties,
> > e.g.
> > 
> > FOO:
> >   Explain what FOO does
> > 
> > BAR:
> >   Explain what BAR does. struct bar also automatically generates a link
> > 
> > Please check with make htmldocs and polish this a bit (might need a small
> > prep patch).
> > 
> 
> I agree the doc should look nice. To get there I might need to chat with you on
> IRC as I'm new to this. 
> 
> > >   */
> > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > >  
> > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > >  	struct i915_engine_class_instance engines[0];
> > >  } __attribute__((packed));
> > >  
> > > +/*
> > > + * i915_context_engines_parallel_submit:
> > > + *
> > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > + *
> > > + * All hardware contexts in the engine set are configured for parallel
> > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > + * always submit N BBs in a single IOCTL.
> > > + *
> > > + * Their are two currently defined ways to control the placement of the
> > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > + * future as new hardware / use cases arise. Details of how to use this
> > > + * interface below above the flags.
> > > + *
> > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > + * placement configuration isn't supported on the platform / submission
> > > + * interface.
> > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > + * inteface.
> > > + */
> > > +struct i915_context_engines_parallel_submit {
> > > +	struct i915_user_extension base;
> > 
> > Ok this is good, since it makes sure we can't possible use this in
> > CTX_SETPARAM.
> > 
> 
> Yep, this is at context creation time. Technically you still can call this over
> and over on the same gem context but Jason is taking that ability away I
> believe. I've also told the media team to setup the context once and don't touch
> it again.

Only if you base your context param on drm_i915_gem_context_param, which
can be used both at create time with
drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
But you don't, so this issue is fixed at the uapi design and doesn't need
to interface with Jason's prot-ctx rework much.

There's still going to be some conflicts, so maybe ask Jason for a branch
and rebase GuC on top of that for the next round.

> 
> > > +
> > > +/*
> > > + * Default placement behvavior (currently unsupported):
> > > + *
> > > + * Rather than restricting parallel submission to a single class with a
> > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > + * enables parallel submission across multiple engine classes. In this case each
> > > + * context's logical engine mask indicates where that context can placed. It is
> > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > + * if one context is running CS0 no other contexts can run on CS0).
> > > + *
> > > + * Example 1 pseudo code:
> > > + * CSX[Y] = engine class X, logical instance Y
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID, INVALID)
> > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > + * set_parallel()
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS0[0], CS1[0]
> > > + * CS0[0], CS1[1]
> > > + * CS0[1], CS1[0]
> > > + * CS0[1], CS1[1]
> > > + *
> > > + * Example 2 pseudo code:
> > > + * CS[X] = generic engine of same class, logical instance X
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID, INVALID)
> > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > + * set_parallel()
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS[0], CS[1]
> > > + * CS[0], CS[2]
> > > + * CS[1], CS[0]
> > > + * CS[1], CS[2]
> > > + * CS[2], CS[0]
> > > + * CS[2], CS[1]
> > > + *
> > > + * This enables a use case where all engines are created equally, we don't care
> > > + * where they are scheduled, we just want a certain number of resources, for
> > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > + * engine classes.
> > > + */
> > > +
> > > +/*
> > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > + * Each context must have the same number sibling and bonds are implictly create
> > > + * of the siblings.
> > > + *
> > > + * All of the below examples are in logical space.
> > > + *
> > > + * Example 1 pseudo code:
> > > + * CS[X] = generic engine of same class, logical instance X
> > > + * set_engines(CS[0], CS[1])
> > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS[0], CS[1]
> > > + *
> > > + * Example 2 pseudo code:
> > > + * CS[X] = generic engine of same class, logical instance X
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID, INVALID)
> > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > + *
> > > + * Results in the following valid placements:
> > > + * CS[0], CS[1]
> > > + * CS[2], CS[3]
> > > + *
> > > + * This enables a use case where all engines are not equal and certain placement
> > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > + * case (logically contiguous placement, within a single engine class) is
> > > + * supported when using GuC submission. Execlist mode could support all possible
> > > + * bonding configurations but currently doesn't support this extension.
> > > + */
> > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > +/*
> > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > + * points on all hardware contexts between each set of BBs. An example use case
> > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > + */
> > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > +	__u64 flags; /* all undefined flags must be zero */
> > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > +} __attribute__ ((packed));
> > 
> > Ok I'm having some serious questions. This looks way too much like it's
> > inspired by bonded submission, and given we're tossing bonded submission
> > we need to make sure we're doing this for good independent reasons and not
> > just for intertia.
> > 
> 
> You are not wrong here, the bonding submission interface was a factor in
> designing this interface.
> 
> > What I expected looking at how media-driver uses bonded submit currently
> > is:
> > 
> > - We create a parallel submit engine, which occupies a virtual engine
> >   slot. This parallel virtual engine contains all the information we need,
> >   i.e. the flags you have above, but also how many engines run in parallel
> >   and how each of those can be load-balanced. So probably a full NxM
> >   matrix of physical engines needed.
> > 
> 
> Internally we need all this information broken out into individual structures,
> at least with the current implementation. We need N ring buffers, N timelines, N
> LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> which occupies a slot. Could we create a super object with N 'struct
> intel_context', sure. I'm just not sure what that buys us and IMO creates an
> inconsistent uAPI.

So if the implementation is too much work to adapt, here's a really nasty
trick: Currently we limit the engine slots to 64 in a gem context, because
that's the limit of the execbuf field. We could use the engine slots above
that for all these additional intel_context that we need underneath, at
least for execlist. Does GuC need them all too?

But clean approach would be to have an intel_parallal_engine struct which
has all these pointers internally I think.

Same on the high-level execbuf flow, doing all that N times is silly. So
again I'd assume there's one overall i915_request that tracks the parallel
submission, and then maybe N subordinate i915_request for each piece
(execlist backend definitely needs those for scheduling, I didn't check
about GuC).

Also drm/scheduler only deals with a single thing too, so that way the
high level code would never need to know that there's actually N things
underneath doing the job.

> > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> >   in one go.
> > 
> 
> If we expose 1 or N engines it doesn't really matter, does it? Either way the
> entire GEM context is configured for N BBs in a single IOCTL.
> 
> > - This means we don't create virtual engines (or physical engine mappings)
> >   for all the individual pieces in a parallel engine. That's a concept
> >   from bonded submission, and I think that needs to go.
> > 
> 
> Again this isn't strickly true - we need N internal backing structures.

I didn't check the code, but iirc you said for the GuC backend you do
nothing until the last submit. Only then it's pushed into the GuC. That
sounds a bit silly, and by treating parallel submission as a single thing
(which might or mightnot be split in lower levels) this would go away.

But it also might be way too much churn, because there's a bunch of places
where we have to do this splitting. If it's all, then maybe just keeping
the engines around everywhere makes sense.

But also this is leaking implementation details into uapi, from umd pov
it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
Leaking that we treat it as N engines underneath feels like a mistake.

> > - More important not having a parallel virtual engine breaks our already
> >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> >   is just a container object, which points at a bunch of engines (plus the
> >   VM and a few other things). Having parallel context something that sits
> >   at the gem ctx level, and not as an individual engine (of which you can
> >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> >   sits at the gem ctx level, so that you can capture all the perf data for
> >   an entire workload spawning across multiple engines. If a workload now
> >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> >   this.
> 
> This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> love to fix this and changing a context to a single slot might be able to fix
> this.

Yeah this is essentially the main gripe I have with this. Everywhere else
you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
submit, where you submit to a gem_ctx_id and the engine slot doesn't
matter. That's a rather unfortunate uapi.

Now with bonded submit this made some sense (not that bonded submit itself
made much sense), since you did indeed submit N batchbuffers to N
(gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
one execbuf call.

> > So what I'd expect we'd have here is roughly:
> > 
> > struct i915_context_engines_parallel_submit {
> > 	struct i915_user_extension base;
> > 	__u64 flags;
> > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > };
> > 
> > If we then also require that you always submit the full width of N
> > batchbuffers then even the execbuf extension doesn't need to exist
> > anymore, because the virtual parallel engine already contains all the
> > needed information.
> > 
> > And sure for some backends at least (definitely execlist) we'd need to
> > create a bunch of additional virtual engines behind that virtual engine.
> > But they'd be entirely hidden, and not visible to userspace nor the higher
> > levels.
> >
> > What am I missing?
> 
> Not really, I think you got it. I think at the end of day this really comes down
> to do we want to allow more than 1 parallel virtual engine per gem context? If
> the answer is yes we collapse a parallel virtual engine into a single slot, if
> not we leave as is.

Yup. So right now media uses one gem context per engine they need. Since
media doesn't care about perf/OA they could get shared VM by sharing the
VM across gem ctx, which they already do. So probably we could get away if
we leave parallel engines as a gem ctx level thing.

Also on the media-driver code the impact is nil since it's just a
different chain of context extensions in the same ioctl call.

Bigger picture is that Jason is quite unhappy withou our gem ctx based
uapi, and his long term idea is to make gem ctx into a pure container
object with pointers to engines and a vm. And not something that has
relevance itself. Currently that's not the case for perf/OA, which works
on the gem ctx, and Jason's already unhappy about that one. So adding more
stuff on the gem ctx level feels a bit like a mistake.

Cheers, Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > >  	__u64 extensions; \
> > >  	struct i915_engine_class_instance engines[N__]; \
> > > -- 
> > > 2.28.0
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-12  8:34         ` Daniel Vetter
@ 2021-05-14 20:05           ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-14 20:05 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Wed, May 12, 2021 at 10:34:59AM +0200, Daniel Vetter wrote:
> On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> > On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > > i915_drm.h updates for 'set parallel submit' extension.
> > > > 
> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > Cc: Tony Ye <tony.ye@intel.com>
> > > > CC: Carl Zhang <carl.zhang@intel.com>
> > > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 126 insertions(+)
> > > > 
> > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > index 26d2e135aa31..0175b12b33b8 100644
> > > > --- a/include/uapi/drm/i915_drm.h
> > > > +++ b/include/uapi/drm/i915_drm.h
> > > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > > >   * Extensions:
> > > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > > 
> > > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > > also not sure this formats very well as a nice list. Using item lists
> > > should look pretty nice like we're doing for the various kms properties,
> > > e.g.
> > > 
> > > FOO:
> > >   Explain what FOO does
> > > 
> > > BAR:
> > >   Explain what BAR does. struct bar also automatically generates a link
> > > 
> > > Please check with make htmldocs and polish this a bit (might need a small
> > > prep patch).
> > > 
> > 
> > I agree the doc should look nice. To get there I might need to chat with you on
> > IRC as I'm new to this. 
> > 
> > > >   */
> > > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > > >  
> > > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > > >  	struct i915_engine_class_instance engines[0];
> > > >  } __attribute__((packed));
> > > >  
> > > > +/*
> > > > + * i915_context_engines_parallel_submit:
> > > > + *
> > > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > > + *
> > > > + * All hardware contexts in the engine set are configured for parallel
> > > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > > + * always submit N BBs in a single IOCTL.
> > > > + *
> > > > + * Their are two currently defined ways to control the placement of the
> > > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > + * interface below above the flags.
> > > > + *
> > > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > > + * placement configuration isn't supported on the platform / submission
> > > > + * interface.
> > > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > > + * inteface.
> > > > + */
> > > > +struct i915_context_engines_parallel_submit {
> > > > +	struct i915_user_extension base;
> > > 
> > > Ok this is good, since it makes sure we can't possible use this in
> > > CTX_SETPARAM.
> > > 
> > 
> > Yep, this is at context creation time. Technically you still can call this over
> > and over on the same gem context but Jason is taking that ability away I
> > believe. I've also told the media team to setup the context once and don't touch
> > it again.
> 
> Only if you base your context param on drm_i915_gem_context_param, which
> can be used both at create time with
> drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
> But you don't, so this issue is fixed at the uapi design and doesn't need
> to interface with Jason's prot-ctx rework much.
> 
> There's still going to be some conflicts, so maybe ask Jason for a branch
> and rebase GuC on top of that for the next round.
> 

Certainly this new uAPI is going conflict. The basic GuC submission code
shouldn't though as it doesn't touch the uAPI code at all. By the time the new
uAPI is posted I'd hope Jason's proto-ctx rework has landed and will rebase
then on to the tip of DRM.

> > 
> > > > +
> > > > +/*
> > > > + * Default placement behvavior (currently unsupported):
> > > > + *
> > > > + * Rather than restricting parallel submission to a single class with a
> > > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > > + * enables parallel submission across multiple engine classes. In this case each
> > > > + * context's logical engine mask indicates where that context can placed. It is
> > > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > > + * if one context is running CS0 no other contexts can run on CS0).
> > > > + *
> > > > + * Example 1 pseudo code:
> > > > + * CSX[Y] = engine class X, logical instance Y
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID, INVALID)
> > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > > + * set_parallel()
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS0[0], CS1[0]
> > > > + * CS0[0], CS1[1]
> > > > + * CS0[1], CS1[0]
> > > > + * CS0[1], CS1[1]
> > > > + *
> > > > + * Example 2 pseudo code:
> > > > + * CS[X] = generic engine of same class, logical instance X
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID, INVALID)
> > > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > + * set_parallel()
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS[0], CS[1]
> > > > + * CS[0], CS[2]
> > > > + * CS[1], CS[0]
> > > > + * CS[1], CS[2]
> > > > + * CS[2], CS[0]
> > > > + * CS[2], CS[1]
> > > > + *
> > > > + * This enables a use case where all engines are created equally, we don't care
> > > > + * where they are scheduled, we just want a certain number of resources, for
> > > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > > + * engine classes.
> > > > + */
> > > > +
> > > > +/*
> > > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > > + * Each context must have the same number sibling and bonds are implictly create
> > > > + * of the siblings.
> > > > + *
> > > > + * All of the below examples are in logical space.
> > > > + *
> > > > + * Example 1 pseudo code:
> > > > + * CS[X] = generic engine of same class, logical instance X
> > > > + * set_engines(CS[0], CS[1])
> > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS[0], CS[1]
> > > > + *
> > > > + * Example 2 pseudo code:
> > > > + * CS[X] = generic engine of same class, logical instance X
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID, INVALID)
> > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS[0], CS[1]
> > > > + * CS[2], CS[3]
> > > > + *
> > > > + * This enables a use case where all engines are not equal and certain placement
> > > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > > + * case (logically contiguous placement, within a single engine class) is
> > > > + * supported when using GuC submission. Execlist mode could support all possible
> > > > + * bonding configurations but currently doesn't support this extension.
> > > > + */
> > > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > > +/*
> > > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > > + * points on all hardware contexts between each set of BBs. An example use case
> > > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > > + */
> > > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > > +	__u64 flags; /* all undefined flags must be zero */
> > > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > > +} __attribute__ ((packed));
> > > 
> > > Ok I'm having some serious questions. This looks way too much like it's
> > > inspired by bonded submission, and given we're tossing bonded submission
> > > we need to make sure we're doing this for good independent reasons and not
> > > just for intertia.
> > > 
> > 
> > You are not wrong here, the bonding submission interface was a factor in
> > designing this interface.
> > 
> > > What I expected looking at how media-driver uses bonded submit currently
> > > is:
> > > 
> > > - We create a parallel submit engine, which occupies a virtual engine
> > >   slot. This parallel virtual engine contains all the information we need,
> > >   i.e. the flags you have above, but also how many engines run in parallel
> > >   and how each of those can be load-balanced. So probably a full NxM
> > >   matrix of physical engines needed.
> > > 
> > 
> > Internally we need all this information broken out into individual structures,
> > at least with the current implementation. We need N ring buffers, N timelines, N
> > LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> > which occupies a slot. Could we create a super object with N 'struct
> > intel_context', sure. I'm just not sure what that buys us and IMO creates an
> > inconsistent uAPI.
> 
> So if the implementation is too much work to adapt, here's a really nasty
> trick: Currently we limit the engine slots to 64 in a gem context, because
> that's the limit of the execbuf field. We could use the engine slots above
> that for all these additional intel_context that we need underneath, at
> least for execlist. Does GuC need them all too?
> 
> But clean approach would be to have an intel_parallal_engine struct which
> has all these pointers internally I think.
> 
> Same on the high-level execbuf flow, doing all that N times is silly. So
> again I'd assume there's one overall i915_request that tracks the parallel
> submission, and then maybe N subordinate i915_request for each piece
> (execlist backend definitely needs those for scheduling, I didn't check
> about GuC).
> 
> Also drm/scheduler only deals with a single thing too, so that way the
> high level code would never need to know that there's actually N things
> underneath doing the job.
>

Again each i915_request points to a single (and different) intel_context,
timeline, lrc, ring, seqno, etc... The whole stack really treats these as
individual things aside from the excl slot where we form a composite fence. Not
saying we couldn't change this over time but initially creating a
'i915_super_request' would be quite the undertaking, very invasive to the mid
layers of the stack, and not sure in the end what it buys us.

Once the parallel submit gets posted you will be able to see that it is a uAPI
context setup extension, updates the execbuf IOCTL to accept N batches which is
basically a for loop, and GuC backend being able to submit N batches at once -
the mid layers are almost completely untouched.

Lastly, if we need to support the parallel submit extension as purposed for
execlists, all we need to do is update the uAPI setup extension to configure the
contexts. If we create a 'i915_super_request' we would have a massive rework in
execlist backend too.

> > > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> > >   in one go.
> > > 
> > 
> > If we expose 1 or N engines it doesn't really matter, does it? Either way the
> > entire GEM context is configured for N BBs in a single IOCTL.
> > 
> > > - This means we don't create virtual engines (or physical engine mappings)
> > >   for all the individual pieces in a parallel engine. That's a concept
> > >   from bonded submission, and I think that needs to go.
> > > 
> > 
> > Again this isn't strickly true - we need N internal backing structures.
> 
> I didn't check the code, but iirc you said for the GuC backend you do
> nothing until the last submit. Only then it's pushed into the GuC. That
> sounds a bit silly, and by treating parallel submission as a single thing
> (which might or mightnot be split in lower levels) this would go away.
>

We update internal state on each submit, the last submit is the one to interact
with the GuC.
 
> But it also might be way too much churn, because there's a bunch of places
> where we have to do this splitting. If it's all, then maybe just keeping
> the engines around everywhere makes sense.
> 
> But also this is leaking implementation details into uapi, from umd pov
> it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
> Leaking that we treat it as N engines underneath feels like a mistake.
>

Too be clear, changing from N slots to 1 slot isn't that big of a deal. Changing
from N i915_requests to 1 is a *huge* deal.

N slots to 1 slots will just touch the uAPI setup extension and the execbuf
IOCTL.

N i915_requests to 1 will ripple thoughout the entire stack. 
 
> > > - More important not having a parallel virtual engine breaks our already
> > >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> > >   is just a container object, which points at a bunch of engines (plus the
> > >   VM and a few other things). Having parallel context something that sits
> > >   at the gem ctx level, and not as an individual engine (of which you can
> > >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> > >   sits at the gem ctx level, so that you can capture all the perf data for
> > >   an entire workload spawning across multiple engines. If a workload now
> > >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> > >   this.
> > 
> > This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> > love to fix this and changing a context to a single slot might be able to fix
> > this.
> 
> Yeah this is essentially the main gripe I have with this. Everywhere else
> you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
> submit, where you submit to a gem_ctx_id and the engine slot doesn't
> matter. That's a rather unfortunate uapi.
> 

Yea this isn't ideal but we've kinda backed ourselves into a corner here at
least consistency wise.

As purposed we basically have 2 steps to configure a gem context:

1. Define placement rules (set_engines, set_load_balance)
2. Indicate this context is used for parallel submission (set_parallel)

What would the the uAPI look like where a each parallel context occupies a slot?

1. Define a the number of slots (set_engines)
2. For each slot allow a virtual or parallel context (set_load_balance,
set_parallel)

The set_parallel would have to contain all the placement information for 2 to N
contexts, right? So each set_parallel is chained extension too. Now we have a
two level chain in our IOCTL.

e.g.

set_engines (3 slots) -> set_load_balance (slot 0) -> set_parallel (slot 1) ->                                                            -> set_load_balance (slot 2)
									     | 								  |
									     > placement for context 0 -> placement for context 1, etc...->	

IMO this seems like a bigger mess but I suppose it would work.

As you say below gem contexts are moving towards just being containers so does
it really matter if a UMD has to create a gem context per a parallel context?
They still can share address space, pass fences between them, etc...

If we really want to go the direction of 1 slot per parallel context I can hack
up a PoC branch when I have time. 

Matt

> Now with bonded submit this made some sense (not that bonded submit itself
> made much sense), since you did indeed submit N batchbuffers to N
> (gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
> one execbuf call.
> 
> > > So what I'd expect we'd have here is roughly:
> > > 
> > > struct i915_context_engines_parallel_submit {
> > > 	struct i915_user_extension base;
> > > 	__u64 flags;
> > > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > > };
> > > 
> > > If we then also require that you always submit the full width of N
> > > batchbuffers then even the execbuf extension doesn't need to exist
> > > anymore, because the virtual parallel engine already contains all the
> > > needed information.
> > > 
> > > And sure for some backends at least (definitely execlist) we'd need to
> > > create a bunch of additional virtual engines behind that virtual engine.
> > > But they'd be entirely hidden, and not visible to userspace nor the higher
> > > levels.
> > >
> > > What am I missing?
> > 
> > Not really, I think you got it. I think at the end of day this really comes down
> > to do we want to allow more than 1 parallel virtual engine per gem context? If
> > the answer is yes we collapse a parallel virtual engine into a single slot, if
> > not we leave as is.
> 
> Yup. So right now media uses one gem context per engine they need. Since
> media doesn't care about perf/OA they could get shared VM by sharing the
> VM across gem ctx, which they already do. So probably we could get away if
> we leave parallel engines as a gem ctx level thing.
> 
> Also on the media-driver code the impact is nil since it's just a
> different chain of context extensions in the same ioctl call.
> 
> Bigger picture is that Jason is quite unhappy withou our gem ctx based
> uapi, and his long term idea is to make gem ctx into a pure container
> object with pointers to engines and a vm. And not something that has
> relevance itself. Currently that's not the case for perf/OA, which works
> on the gem ctx, and Jason's already unhappy about that one. So adding more
> stuff on the gem ctx level feels a bit like a mistake.
> 
> Cheers, Daniel
> 
> > 
> > Matt
> > 
> > > -Daniel
> > > 
> > > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > > >  	__u64 extensions; \
> > > >  	struct i915_engine_class_instance engines[N__]; \
> > > > -- 
> > > > 2.28.0
> > > > 
> > > > _______________________________________________
> > > > Intel-gfx mailing list
> > > > Intel-gfx@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-14 20:05           ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-14 20:05 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Wed, May 12, 2021 at 10:34:59AM +0200, Daniel Vetter wrote:
> On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> > On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > > i915_drm.h updates for 'set parallel submit' extension.
> > > > 
> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > Cc: Tony Ye <tony.ye@intel.com>
> > > > CC: Carl Zhang <carl.zhang@intel.com>
> > > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 126 insertions(+)
> > > > 
> > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > index 26d2e135aa31..0175b12b33b8 100644
> > > > --- a/include/uapi/drm/i915_drm.h
> > > > +++ b/include/uapi/drm/i915_drm.h
> > > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > > >   * Extensions:
> > > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > > 
> > > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > > also not sure this formats very well as a nice list. Using item lists
> > > should look pretty nice like we're doing for the various kms properties,
> > > e.g.
> > > 
> > > FOO:
> > >   Explain what FOO does
> > > 
> > > BAR:
> > >   Explain what BAR does. struct bar also automatically generates a link
> > > 
> > > Please check with make htmldocs and polish this a bit (might need a small
> > > prep patch).
> > > 
> > 
> > I agree the doc should look nice. To get there I might need to chat with you on
> > IRC as I'm new to this. 
> > 
> > > >   */
> > > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > > >  
> > > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > > >  	struct i915_engine_class_instance engines[0];
> > > >  } __attribute__((packed));
> > > >  
> > > > +/*
> > > > + * i915_context_engines_parallel_submit:
> > > > + *
> > > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > > + *
> > > > + * All hardware contexts in the engine set are configured for parallel
> > > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > > + * always submit N BBs in a single IOCTL.
> > > > + *
> > > > + * Their are two currently defined ways to control the placement of the
> > > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > + * interface below above the flags.
> > > > + *
> > > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > > + * placement configuration isn't supported on the platform / submission
> > > > + * interface.
> > > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > > + * inteface.
> > > > + */
> > > > +struct i915_context_engines_parallel_submit {
> > > > +	struct i915_user_extension base;
> > > 
> > > Ok this is good, since it makes sure we can't possible use this in
> > > CTX_SETPARAM.
> > > 
> > 
> > Yep, this is at context creation time. Technically you still can call this over
> > and over on the same gem context but Jason is taking that ability away I
> > believe. I've also told the media team to setup the context once and don't touch
> > it again.
> 
> Only if you base your context param on drm_i915_gem_context_param, which
> can be used both at create time with
> drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
> But you don't, so this issue is fixed at the uapi design and doesn't need
> to interface with Jason's prot-ctx rework much.
> 
> There's still going to be some conflicts, so maybe ask Jason for a branch
> and rebase GuC on top of that for the next round.
> 

Certainly this new uAPI is going conflict. The basic GuC submission code
shouldn't though as it doesn't touch the uAPI code at all. By the time the new
uAPI is posted I'd hope Jason's proto-ctx rework has landed and will rebase
then on to the tip of DRM.

> > 
> > > > +
> > > > +/*
> > > > + * Default placement behvavior (currently unsupported):
> > > > + *
> > > > + * Rather than restricting parallel submission to a single class with a
> > > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > > + * enables parallel submission across multiple engine classes. In this case each
> > > > + * context's logical engine mask indicates where that context can placed. It is
> > > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > > + * if one context is running CS0 no other contexts can run on CS0).
> > > > + *
> > > > + * Example 1 pseudo code:
> > > > + * CSX[Y] = engine class X, logical instance Y
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID, INVALID)
> > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > > + * set_parallel()
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS0[0], CS1[0]
> > > > + * CS0[0], CS1[1]
> > > > + * CS0[1], CS1[0]
> > > > + * CS0[1], CS1[1]
> > > > + *
> > > > + * Example 2 pseudo code:
> > > > + * CS[X] = generic engine of same class, logical instance X
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID, INVALID)
> > > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > + * set_parallel()
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS[0], CS[1]
> > > > + * CS[0], CS[2]
> > > > + * CS[1], CS[0]
> > > > + * CS[1], CS[2]
> > > > + * CS[2], CS[0]
> > > > + * CS[2], CS[1]
> > > > + *
> > > > + * This enables a use case where all engines are created equally, we don't care
> > > > + * where they are scheduled, we just want a certain number of resources, for
> > > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > > + * engine classes.
> > > > + */
> > > > +
> > > > +/*
> > > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > > + * Each context must have the same number sibling and bonds are implictly create
> > > > + * of the siblings.
> > > > + *
> > > > + * All of the below examples are in logical space.
> > > > + *
> > > > + * Example 1 pseudo code:
> > > > + * CS[X] = generic engine of same class, logical instance X
> > > > + * set_engines(CS[0], CS[1])
> > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS[0], CS[1]
> > > > + *
> > > > + * Example 2 pseudo code:
> > > > + * CS[X] = generic engine of same class, logical instance X
> > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > + * set_engines(INVALID, INVALID)
> > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > + *
> > > > + * Results in the following valid placements:
> > > > + * CS[0], CS[1]
> > > > + * CS[2], CS[3]
> > > > + *
> > > > + * This enables a use case where all engines are not equal and certain placement
> > > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > > + * case (logically contiguous placement, within a single engine class) is
> > > > + * supported when using GuC submission. Execlist mode could support all possible
> > > > + * bonding configurations but currently doesn't support this extension.
> > > > + */
> > > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > > +/*
> > > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > > + * points on all hardware contexts between each set of BBs. An example use case
> > > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > > + */
> > > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > > +	__u64 flags; /* all undefined flags must be zero */
> > > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > > +} __attribute__ ((packed));
> > > 
> > > Ok I'm having some serious questions. This looks way too much like it's
> > > inspired by bonded submission, and given we're tossing bonded submission
> > > we need to make sure we're doing this for good independent reasons and not
> > > just for intertia.
> > > 
> > 
> > You are not wrong here, the bonding submission interface was a factor in
> > designing this interface.
> > 
> > > What I expected looking at how media-driver uses bonded submit currently
> > > is:
> > > 
> > > - We create a parallel submit engine, which occupies a virtual engine
> > >   slot. This parallel virtual engine contains all the information we need,
> > >   i.e. the flags you have above, but also how many engines run in parallel
> > >   and how each of those can be load-balanced. So probably a full NxM
> > >   matrix of physical engines needed.
> > > 
> > 
> > Internally we need all this information broken out into individual structures,
> > at least with the current implementation. We need N ring buffers, N timelines, N
> > LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> > which occupies a slot. Could we create a super object with N 'struct
> > intel_context', sure. I'm just not sure what that buys us and IMO creates an
> > inconsistent uAPI.
> 
> So if the implementation is too much work to adapt, here's a really nasty
> trick: Currently we limit the engine slots to 64 in a gem context, because
> that's the limit of the execbuf field. We could use the engine slots above
> that for all these additional intel_context that we need underneath, at
> least for execlist. Does GuC need them all too?
> 
> But clean approach would be to have an intel_parallal_engine struct which
> has all these pointers internally I think.
> 
> Same on the high-level execbuf flow, doing all that N times is silly. So
> again I'd assume there's one overall i915_request that tracks the parallel
> submission, and then maybe N subordinate i915_request for each piece
> (execlist backend definitely needs those for scheduling, I didn't check
> about GuC).
> 
> Also drm/scheduler only deals with a single thing too, so that way the
> high level code would never need to know that there's actually N things
> underneath doing the job.
>

Again each i915_request points to a single (and different) intel_context,
timeline, lrc, ring, seqno, etc... The whole stack really treats these as
individual things aside from the excl slot where we form a composite fence. Not
saying we couldn't change this over time but initially creating a
'i915_super_request' would be quite the undertaking, very invasive to the mid
layers of the stack, and not sure in the end what it buys us.

Once the parallel submit gets posted you will be able to see that it is a uAPI
context setup extension, updates the execbuf IOCTL to accept N batches which is
basically a for loop, and GuC backend being able to submit N batches at once -
the mid layers are almost completely untouched.

Lastly, if we need to support the parallel submit extension as purposed for
execlists, all we need to do is update the uAPI setup extension to configure the
contexts. If we create a 'i915_super_request' we would have a massive rework in
execlist backend too.

> > > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> > >   in one go.
> > > 
> > 
> > If we expose 1 or N engines it doesn't really matter, does it? Either way the
> > entire GEM context is configured for N BBs in a single IOCTL.
> > 
> > > - This means we don't create virtual engines (or physical engine mappings)
> > >   for all the individual pieces in a parallel engine. That's a concept
> > >   from bonded submission, and I think that needs to go.
> > > 
> > 
> > Again this isn't strickly true - we need N internal backing structures.
> 
> I didn't check the code, but iirc you said for the GuC backend you do
> nothing until the last submit. Only then it's pushed into the GuC. That
> sounds a bit silly, and by treating parallel submission as a single thing
> (which might or mightnot be split in lower levels) this would go away.
>

We update internal state on each submit, the last submit is the one to interact
with the GuC.
 
> But it also might be way too much churn, because there's a bunch of places
> where we have to do this splitting. If it's all, then maybe just keeping
> the engines around everywhere makes sense.
> 
> But also this is leaking implementation details into uapi, from umd pov
> it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
> Leaking that we treat it as N engines underneath feels like a mistake.
>

Too be clear, changing from N slots to 1 slot isn't that big of a deal. Changing
from N i915_requests to 1 is a *huge* deal.

N slots to 1 slots will just touch the uAPI setup extension and the execbuf
IOCTL.

N i915_requests to 1 will ripple thoughout the entire stack. 
 
> > > - More important not having a parallel virtual engine breaks our already
> > >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> > >   is just a container object, which points at a bunch of engines (plus the
> > >   VM and a few other things). Having parallel context something that sits
> > >   at the gem ctx level, and not as an individual engine (of which you can
> > >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> > >   sits at the gem ctx level, so that you can capture all the perf data for
> > >   an entire workload spawning across multiple engines. If a workload now
> > >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> > >   this.
> > 
> > This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> > love to fix this and changing a context to a single slot might be able to fix
> > this.
> 
> Yeah this is essentially the main gripe I have with this. Everywhere else
> you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
> submit, where you submit to a gem_ctx_id and the engine slot doesn't
> matter. That's a rather unfortunate uapi.
> 

Yea this isn't ideal but we've kinda backed ourselves into a corner here at
least consistency wise.

As purposed we basically have 2 steps to configure a gem context:

1. Define placement rules (set_engines, set_load_balance)
2. Indicate this context is used for parallel submission (set_parallel)

What would the the uAPI look like where a each parallel context occupies a slot?

1. Define a the number of slots (set_engines)
2. For each slot allow a virtual or parallel context (set_load_balance,
set_parallel)

The set_parallel would have to contain all the placement information for 2 to N
contexts, right? So each set_parallel is chained extension too. Now we have a
two level chain in our IOCTL.

e.g.

set_engines (3 slots) -> set_load_balance (slot 0) -> set_parallel (slot 1) ->                                                            -> set_load_balance (slot 2)
									     | 								  |
									     > placement for context 0 -> placement for context 1, etc...->	

IMO this seems like a bigger mess but I suppose it would work.

As you say below gem contexts are moving towards just being containers so does
it really matter if a UMD has to create a gem context per a parallel context?
They still can share address space, pass fences between them, etc...

If we really want to go the direction of 1 slot per parallel context I can hack
up a PoC branch when I have time. 

Matt

> Now with bonded submit this made some sense (not that bonded submit itself
> made much sense), since you did indeed submit N batchbuffers to N
> (gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
> one execbuf call.
> 
> > > So what I'd expect we'd have here is roughly:
> > > 
> > > struct i915_context_engines_parallel_submit {
> > > 	struct i915_user_extension base;
> > > 	__u64 flags;
> > > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > > };
> > > 
> > > If we then also require that you always submit the full width of N
> > > batchbuffers then even the execbuf extension doesn't need to exist
> > > anymore, because the virtual parallel engine already contains all the
> > > needed information.
> > > 
> > > And sure for some backends at least (definitely execlist) we'd need to
> > > create a bunch of additional virtual engines behind that virtual engine.
> > > But they'd be entirely hidden, and not visible to userspace nor the higher
> > > levels.
> > >
> > > What am I missing?
> > 
> > Not really, I think you got it. I think at the end of day this really comes down
> > to do we want to allow more than 1 parallel virtual engine per gem context? If
> > the answer is yes we collapse a parallel virtual engine into a single slot, if
> > not we leave as is.
> 
> Yup. So right now media uses one gem context per engine they need. Since
> media doesn't care about perf/OA they could get shared VM by sharing the
> VM across gem ctx, which they already do. So probably we could get away if
> we leave parallel engines as a gem ctx level thing.
> 
> Also on the media-driver code the impact is nil since it's just a
> different chain of context extensions in the same ioctl call.
> 
> Bigger picture is that Jason is quite unhappy withou our gem ctx based
> uapi, and his long term idea is to make gem ctx into a pure container
> object with pointers to engines and a vm. And not something that has
> relevance itself. Currently that's not the case for perf/OA, which works
> on the gem ctx, and Jason's already unhappy about that one. So adding more
> stuff on the gem ctx level feels a bit like a mistake.
> 
> Cheers, Daniel
> 
> > 
> > Matt
> > 
> > > -Daniel
> > > 
> > > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > > >  	__u64 extensions; \
> > > >  	struct i915_engine_class_instance engines[N__]; \
> > > > -- 
> > > > 2.28.0
> > > > 
> > > > _______________________________________________
> > > > Intel-gfx mailing list
> > > > Intel-gfx@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-14 20:05           ` Matthew Brost
@ 2021-05-17 13:55             ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-17 13:55 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, carl.zhang, jason.ekstrand, daniel.vetter

On Fri, May 14, 2021 at 01:05:33PM -0700, Matthew Brost wrote:
> On Wed, May 12, 2021 at 10:34:59AM +0200, Daniel Vetter wrote:
> > On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> > > On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > > > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > > > i915_drm.h updates for 'set parallel submit' extension.
> > > > > 
> > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > Cc: Tony Ye <tony.ye@intel.com>
> > > > > CC: Carl Zhang <carl.zhang@intel.com>
> > > > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 126 insertions(+)
> > > > > 
> > > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > > index 26d2e135aa31..0175b12b33b8 100644
> > > > > --- a/include/uapi/drm/i915_drm.h
> > > > > +++ b/include/uapi/drm/i915_drm.h
> > > > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > > > >   * Extensions:
> > > > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > > > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > > > 
> > > > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > > > also not sure this formats very well as a nice list. Using item lists
> > > > should look pretty nice like we're doing for the various kms properties,
> > > > e.g.
> > > > 
> > > > FOO:
> > > >   Explain what FOO does
> > > > 
> > > > BAR:
> > > >   Explain what BAR does. struct bar also automatically generates a link
> > > > 
> > > > Please check with make htmldocs and polish this a bit (might need a small
> > > > prep patch).
> > > > 
> > > 
> > > I agree the doc should look nice. To get there I might need to chat with you on
> > > IRC as I'm new to this. 
> > > 
> > > > >   */
> > > > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > > > >  
> > > > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > > > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > > > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > > > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > > > >  	struct i915_engine_class_instance engines[0];
> > > > >  } __attribute__((packed));
> > > > >  
> > > > > +/*
> > > > > + * i915_context_engines_parallel_submit:
> > > > > + *
> > > > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > > > + *
> > > > > + * All hardware contexts in the engine set are configured for parallel
> > > > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > > > + * always submit N BBs in a single IOCTL.
> > > > > + *
> > > > > + * Their are two currently defined ways to control the placement of the
> > > > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > > + * interface below above the flags.
> > > > > + *
> > > > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > > > + * placement configuration isn't supported on the platform / submission
> > > > > + * interface.
> > > > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > > > + * inteface.
> > > > > + */
> > > > > +struct i915_context_engines_parallel_submit {
> > > > > +	struct i915_user_extension base;
> > > > 
> > > > Ok this is good, since it makes sure we can't possible use this in
> > > > CTX_SETPARAM.
> > > > 
> > > 
> > > Yep, this is at context creation time. Technically you still can call this over
> > > and over on the same gem context but Jason is taking that ability away I
> > > believe. I've also told the media team to setup the context once and don't touch
> > > it again.
> > 
> > Only if you base your context param on drm_i915_gem_context_param, which
> > can be used both at create time with
> > drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
> > But you don't, so this issue is fixed at the uapi design and doesn't need
> > to interface with Jason's prot-ctx rework much.
> > 
> > There's still going to be some conflicts, so maybe ask Jason for a branch
> > and rebase GuC on top of that for the next round.
> > 
> 
> Certainly this new uAPI is going conflict. The basic GuC submission code
> shouldn't though as it doesn't touch the uAPI code at all. By the time the new
> uAPI is posted I'd hope Jason's proto-ctx rework has landed and will rebase
> then on to the tip of DRM.

Ah yes. Another good reasons to split that up into two parts, like we've
already planned to.

> > > > > +
> > > > > +/*
> > > > > + * Default placement behvavior (currently unsupported):
> > > > > + *
> > > > > + * Rather than restricting parallel submission to a single class with a
> > > > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > > > + * enables parallel submission across multiple engine classes. In this case each
> > > > > + * context's logical engine mask indicates where that context can placed. It is
> > > > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > > > + * if one context is running CS0 no other contexts can run on CS0).
> > > > > + *
> > > > > + * Example 1 pseudo code:
> > > > > + * CSX[Y] = engine class X, logical instance Y
> > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > + * set_engines(INVALID, INVALID)
> > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > > > + * set_parallel()
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS0[0], CS1[0]
> > > > > + * CS0[0], CS1[1]
> > > > > + * CS0[1], CS1[0]
> > > > > + * CS0[1], CS1[1]
> > > > > + *
> > > > > + * Example 2 pseudo code:
> > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > + * set_engines(INVALID, INVALID)
> > > > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > + * set_parallel()
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS[0], CS[1]
> > > > > + * CS[0], CS[2]
> > > > > + * CS[1], CS[0]
> > > > > + * CS[1], CS[2]
> > > > > + * CS[2], CS[0]
> > > > > + * CS[2], CS[1]
> > > > > + *
> > > > > + * This enables a use case where all engines are created equally, we don't care
> > > > > + * where they are scheduled, we just want a certain number of resources, for
> > > > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > > > + * engine classes.
> > > > > + */
> > > > > +
> > > > > +/*
> > > > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > > > + * Each context must have the same number sibling and bonds are implictly create
> > > > > + * of the siblings.
> > > > > + *
> > > > > + * All of the below examples are in logical space.
> > > > > + *
> > > > > + * Example 1 pseudo code:
> > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > + * set_engines(CS[0], CS[1])
> > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS[0], CS[1]
> > > > > + *

> > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > + * set_engines(INVALID, INVALID)
> > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS[0], CS[1]
> > > > > + * CS[2], CS[3]
> > > > > + *
> > > > > + * This enables a use case where all engines are not equal and certain placement
> > > > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > > > + * case (logically contiguous placement, within a single engine class) is
> > > > > + * supported when using GuC submission. Execlist mode could support all possible
> > > > > + * bonding configurations but currently doesn't support this extension.
> > > > > + */
> > > > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > > > +/*
> > > > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > > > + * points on all hardware contexts between each set of BBs. An example use case
> > > > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > > > + */
> > > > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > > > +	__u64 flags; /* all undefined flags must be zero */
> > > > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > > > +} __attribute__ ((packed));
> > > > 
> > > > Ok I'm having some serious questions. This looks way too much like it's
> > > > inspired by bonded submission, and given we're tossing bonded submission
> > > > we need to make sure we're doing this for good independent reasons and not
> > > > just for intertia.
> > > > 
> > > 
> > > You are not wrong here, the bonding submission interface was a factor in
> > > designing this interface.
> > > 
> > > > What I expected looking at how media-driver uses bonded submit currently
> > > > is:
> > > > 
> > > > - We create a parallel submit engine, which occupies a virtual engine
> > > >   slot. This parallel virtual engine contains all the information we need,
> > > >   i.e. the flags you have above, but also how many engines run in parallel
> > > >   and how each of those can be load-balanced. So probably a full NxM
> > > >   matrix of physical engines needed.
> > > > 
> > > 
> > > Internally we need all this information broken out into individual structures,
> > > at least with the current implementation. We need N ring buffers, N timelines, N
> > > LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> > > which occupies a slot. Could we create a super object with N 'struct
> > > intel_context', sure. I'm just not sure what that buys us and IMO creates an
> > > inconsistent uAPI.
> > 
> > So if the implementation is too much work to adapt, here's a really nasty
> > trick: Currently we limit the engine slots to 64 in a gem context, because
> > that's the limit of the execbuf field. We could use the engine slots above
> > that for all these additional intel_context that we need underneath, at
> > least for execlist. Does GuC need them all too?
> > 
> > But clean approach would be to have an intel_parallal_engine struct which
> > has all these pointers internally I think.
> > 
> > Same on the high-level execbuf flow, doing all that N times is silly. So
> > again I'd assume there's one overall i915_request that tracks the parallel
> > submission, and then maybe N subordinate i915_request for each piece
> > (execlist backend definitely needs those for scheduling, I didn't check
> > about GuC).
> > 
> > Also drm/scheduler only deals with a single thing too, so that way the
> > high level code would never need to know that there's actually N things
> > underneath doing the job.
> >
> 
> Again each i915_request points to a single (and different) intel_context,
> timeline, lrc, ring, seqno, etc... The whole stack really treats these as
> individual things aside from the excl slot where we form a composite fence. Not
> saying we couldn't change this over time but initially creating a
> 'i915_super_request' would be quite the undertaking, very invasive to the mid
> layers of the stack, and not sure in the end what it buys us.
> 
> Once the parallel submit gets posted you will be able to see that it is a uAPI
> context setup extension, updates the execbuf IOCTL to accept N batches which is
> basically a for loop, and GuC backend being able to submit N batches at once -
> the mid layers are almost completely untouched.
> 
> Lastly, if we need to support the parallel submit extension as purposed for
> execlists, all we need to do is update the uAPI setup extension to configure the
> contexts. If we create a 'i915_super_request' we would have a massive rework in
> execlist backend too.

Yeah I'm fully aware that the current codebase puts us in a very awkward
corner. But also designing uapi by exposing whatever we have internally
right now is also not a good idea.

That's why I've suggested the idea to make the uapi use a single uapi
engine on the gem context, and (for now at least) internally fake it all.
Including the glorious for() loop over everything in execbuf.

> > > > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> > > >   in one go.
> > > > 
> > > 
> > > If we expose 1 or N engines it doesn't really matter, does it? Either way the
> > > entire GEM context is configured for N BBs in a single IOCTL.
> > > 
> > > > - This means we don't create virtual engines (or physical engine mappings)
> > > >   for all the individual pieces in a parallel engine. That's a concept
> > > >   from bonded submission, and I think that needs to go.
> > > > 
> > > 
> > > Again this isn't strickly true - we need N internal backing structures.
> > 
> > I didn't check the code, but iirc you said for the GuC backend you do
> > nothing until the last submit. Only then it's pushed into the GuC. That
> > sounds a bit silly, and by treating parallel submission as a single thing
> > (which might or mightnot be split in lower levels) this would go away.
> >
> 
> We update internal state on each submit, the last submit is the one to interact
> with the GuC.

Sounds very much like sunk cost fallacy driven implementation design, but
oh well.

> > But it also might be way too much churn, because there's a bunch of places
> > where we have to do this splitting. If it's all, then maybe just keeping
> > the engines around everywhere makes sense.
> > 
> > But also this is leaking implementation details into uapi, from umd pov
> > it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
> > Leaking that we treat it as N engines underneath feels like a mistake.
> >
> 
> Too be clear, changing from N slots to 1 slot isn't that big of a deal. Changing
> from N i915_requests to 1 is a *huge* deal.
> 
> N slots to 1 slots will just touch the uAPI setup extension and the execbuf
> IOCTL.
> 
> N i915_requests to 1 will ripple thoughout the entire stack. 

Yeah I think going to 1 i915_request is something we need to postpone and
decide later on whether it makes sense or not.

But making sure the uapi isn't putting roadblocks in that way is something
we need to fix now. And I do think in a clean slate world, ignoring all
the code we have and especially the current midlayer and execlist backend
code, a single ctx/request/execbuf is the right design here. Or well,
would have been.

Except if you now tell me that GuC actually wants N submissions, but my
understanding is it really just wants 1. And the only fan-out we have to
do is plug the right N batchbuffers into the right N LRC of the overall
GuC mutli-LRC context. But that seems not the case.

> > > > - More important not having a parallel virtual engine breaks our already
> > > >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> > > >   is just a container object, which points at a bunch of engines (plus the
> > > >   VM and a few other things). Having parallel context something that sits
> > > >   at the gem ctx level, and not as an individual engine (of which you can
> > > >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> > > >   sits at the gem ctx level, so that you can capture all the perf data for
> > > >   an entire workload spawning across multiple engines. If a workload now
> > > >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> > > >   this.
> > > 
> > > This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> > > love to fix this and changing a context to a single slot might be able to fix
> > > this.
> > 
> > Yeah this is essentially the main gripe I have with this. Everywhere else
> > you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
> > submit, where you submit to a gem_ctx_id and the engine slot doesn't
> > matter. That's a rather unfortunate uapi.
> > 
> 
> Yea this isn't ideal but we've kinda backed ourselves into a corner here at
> least consistency wise.
> 
> As purposed we basically have 2 steps to configure a gem context:
> 
> 1. Define placement rules (set_engines, set_load_balance)
> 2. Indicate this context is used for parallel submission (set_parallel)
> 
> What would the the uAPI look like where a each parallel context occupies a slot?
> 
> 1. Define a the number of slots (set_engines)
> 2. For each slot allow a virtual or parallel context (set_load_balance,
> set_parallel)
> 
> The set_parallel would have to contain all the placement information for 2 to N
> contexts, right? So each set_parallel is chained extension too. Now we have a
> two level chain in our IOCTL.
> 
> e.g.
> 
> set_engines (3 slots) -> set_load_balance (slot 0) -> set_parallel (slot 1) ->                                                            -> set_load_balance (slot 2)
> 									     | 								  |
> 									     > placement for context 0 -> placement for context 1, etc...->	
> 
> IMO this seems like a bigger mess but I suppose it would work.

This sounds a bit like overengineering. All we need for the parallel
virtual engines are a bunch of parameters (engine slot, num_slots,
num_siblings) and then a num_slots X num_siblings array at the end with
the placements for all combinations.

i915_user_extensions already allow for that array at the end (that's why
extensions are chained with pointers, not by being one-after-the-other in
an array), so this already works as-is. See e.g. how set_load_balance or
set_engines work, they already do that.

So purely from an uapi pov I'm not seeing the trouble?

> As you say below gem contexts are moving towards just being containers so does
> it really matter if a UMD has to create a gem context per a parallel context?
> They still can share address space, pass fences between them, etc...

Mostly my concern is that for everything else we execute stuff on an
intel_context and track it with an i915_request. Except for parallel
submission, where we have N-1 fake intel_context and fake i915_request,
and only the last one of each is actually triggering submission.

That's awkward code design at best, and I'd like to have the option at
least we can fix it in the future.

Of course we can fix it all without changing the uapi, like Jason is doing
with the proto ctx rework. But that always comes at the cost of code and
complexity, which isn't strictly needed here I think.

> If we really want to go the direction of 1 slot per parallel context I can hack
> up a PoC branch when I have time. 

I think the only impact of the minimal plan would be:

- slight change to the media-driver setup function, but that's already
  fully encapsulated (at least from a cursory look)

- creating a pile of fake contexts so that current execbuf code and
  scheduler midlayers don't panic, at engine offsets userspace can't see
  them

- some changes in the execbuf code to get at all the fake contexts and
  figure out how many batchbuffers we need

None of this should be giantic, but it keeps the door open that we can fix
the internals properly, without having to carry a compat layer around
forever.

Thoughts? If you think "not worth at all" and Jason concurs then I'm happy
to let this one slide.

Cheers, Daniel

> 
> Matt
> 
> > Now with bonded submit this made some sense (not that bonded submit itself
> > made much sense), since you did indeed submit N batchbuffers to N
> > (gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
> > one execbuf call.
> > 
> > > > So what I'd expect we'd have here is roughly:
> > > > 
> > > > struct i915_context_engines_parallel_submit {
> > > > 	struct i915_user_extension base;
> > > > 	__u64 flags;
> > > > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > > > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > > > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > > > };
> > > > 
> > > > If we then also require that you always submit the full width of N
> > > > batchbuffers then even the execbuf extension doesn't need to exist
> > > > anymore, because the virtual parallel engine already contains all the
> > > > needed information.
> > > > 
> > > > And sure for some backends at least (definitely execlist) we'd need to
> > > > create a bunch of additional virtual engines behind that virtual engine.
> > > > But they'd be entirely hidden, and not visible to userspace nor the higher
> > > > levels.
> > > >
> > > > What am I missing?
> > > 
> > > Not really, I think you got it. I think at the end of day this really comes down
> > > to do we want to allow more than 1 parallel virtual engine per gem context? If
> > > the answer is yes we collapse a parallel virtual engine into a single slot, if
> > > not we leave as is.
> > 
> > Yup. So right now media uses one gem context per engine they need. Since
> > media doesn't care about perf/OA they could get shared VM by sharing the
> > VM across gem ctx, which they already do. So probably we could get away if
> > we leave parallel engines as a gem ctx level thing.
> > 
> > Also on the media-driver code the impact is nil since it's just a
> > different chain of context extensions in the same ioctl call.
> > 
> > Bigger picture is that Jason is quite unhappy withou our gem ctx based
> > uapi, and his long term idea is to make gem ctx into a pure container
> > object with pointers to engines and a vm. And not something that has
> > relevance itself. Currently that's not the case for perf/OA, which works
> > on the gem ctx, and Jason's already unhappy about that one. So adding more
> > stuff on the gem ctx level feels a bit like a mistake.
> > 
> > Cheers, Daniel
> > 
> > > 
> > > Matt
> > > 
> > > > -Daniel
> > > > 
> > > > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > > > >  	__u64 extensions; \
> > > > >  	struct i915_engine_class_instance engines[N__]; \
> > > > > -- 
> > > > > 2.28.0
> > > > > 
> > > > > _______________________________________________
> > > > > Intel-gfx mailing list
> > > > > Intel-gfx@lists.freedesktop.org
> > > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-17 13:55             ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2021-05-17 13:55 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, carl.zhang, jason.ekstrand, daniel.vetter

On Fri, May 14, 2021 at 01:05:33PM -0700, Matthew Brost wrote:
> On Wed, May 12, 2021 at 10:34:59AM +0200, Daniel Vetter wrote:
> > On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> > > On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > > > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > > > i915_drm.h updates for 'set parallel submit' extension.
> > > > > 
> > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > Cc: Tony Ye <tony.ye@intel.com>
> > > > > CC: Carl Zhang <carl.zhang@intel.com>
> > > > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 126 insertions(+)
> > > > > 
> > > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > > index 26d2e135aa31..0175b12b33b8 100644
> > > > > --- a/include/uapi/drm/i915_drm.h
> > > > > +++ b/include/uapi/drm/i915_drm.h
> > > > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > > > >   * Extensions:
> > > > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > > > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > > > 
> > > > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > > > also not sure this formats very well as a nice list. Using item lists
> > > > should look pretty nice like we're doing for the various kms properties,
> > > > e.g.
> > > > 
> > > > FOO:
> > > >   Explain what FOO does
> > > > 
> > > > BAR:
> > > >   Explain what BAR does. struct bar also automatically generates a link
> > > > 
> > > > Please check with make htmldocs and polish this a bit (might need a small
> > > > prep patch).
> > > > 
> > > 
> > > I agree the doc should look nice. To get there I might need to chat with you on
> > > IRC as I'm new to this. 
> > > 
> > > > >   */
> > > > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > > > >  
> > > > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > > > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > > > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > > > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > > > >  	struct i915_engine_class_instance engines[0];
> > > > >  } __attribute__((packed));
> > > > >  
> > > > > +/*
> > > > > + * i915_context_engines_parallel_submit:
> > > > > + *
> > > > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > > > + *
> > > > > + * All hardware contexts in the engine set are configured for parallel
> > > > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > > > + * always submit N BBs in a single IOCTL.
> > > > > + *
> > > > > + * Their are two currently defined ways to control the placement of the
> > > > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > > + * interface below above the flags.
> > > > > + *
> > > > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > > > + * placement configuration isn't supported on the platform / submission
> > > > > + * interface.
> > > > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > > > + * inteface.
> > > > > + */
> > > > > +struct i915_context_engines_parallel_submit {
> > > > > +	struct i915_user_extension base;
> > > > 
> > > > Ok this is good, since it makes sure we can't possible use this in
> > > > CTX_SETPARAM.
> > > > 
> > > 
> > > Yep, this is at context creation time. Technically you still can call this over
> > > and over on the same gem context but Jason is taking that ability away I
> > > believe. I've also told the media team to setup the context once and don't touch
> > > it again.
> > 
> > Only if you base your context param on drm_i915_gem_context_param, which
> > can be used both at create time with
> > drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
> > But you don't, so this issue is fixed at the uapi design and doesn't need
> > to interface with Jason's prot-ctx rework much.
> > 
> > There's still going to be some conflicts, so maybe ask Jason for a branch
> > and rebase GuC on top of that for the next round.
> > 
> 
> Certainly this new uAPI is going conflict. The basic GuC submission code
> shouldn't though as it doesn't touch the uAPI code at all. By the time the new
> uAPI is posted I'd hope Jason's proto-ctx rework has landed and will rebase
> then on to the tip of DRM.

Ah yes. Another good reasons to split that up into two parts, like we've
already planned to.

> > > > > +
> > > > > +/*
> > > > > + * Default placement behvavior (currently unsupported):
> > > > > + *
> > > > > + * Rather than restricting parallel submission to a single class with a
> > > > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > > > + * enables parallel submission across multiple engine classes. In this case each
> > > > > + * context's logical engine mask indicates where that context can placed. It is
> > > > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > > > + * if one context is running CS0 no other contexts can run on CS0).
> > > > > + *
> > > > > + * Example 1 pseudo code:
> > > > > + * CSX[Y] = engine class X, logical instance Y
> > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > + * set_engines(INVALID, INVALID)
> > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > > > + * set_parallel()
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS0[0], CS1[0]
> > > > > + * CS0[0], CS1[1]
> > > > > + * CS0[1], CS1[0]
> > > > > + * CS0[1], CS1[1]
> > > > > + *
> > > > > + * Example 2 pseudo code:
> > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > + * set_engines(INVALID, INVALID)
> > > > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > + * set_parallel()
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS[0], CS[1]
> > > > > + * CS[0], CS[2]
> > > > > + * CS[1], CS[0]
> > > > > + * CS[1], CS[2]
> > > > > + * CS[2], CS[0]
> > > > > + * CS[2], CS[1]
> > > > > + *
> > > > > + * This enables a use case where all engines are created equally, we don't care
> > > > > + * where they are scheduled, we just want a certain number of resources, for
> > > > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > > > + * engine classes.
> > > > > + */
> > > > > +
> > > > > +/*
> > > > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > > > + * Each context must have the same number sibling and bonds are implictly create
> > > > > + * of the siblings.
> > > > > + *
> > > > > + * All of the below examples are in logical space.
> > > > > + *
> > > > > + * Example 1 pseudo code:
> > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > + * set_engines(CS[0], CS[1])
> > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS[0], CS[1]
> > > > > + *

> > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > + * set_engines(INVALID, INVALID)
> > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > + *
> > > > > + * Results in the following valid placements:
> > > > > + * CS[0], CS[1]
> > > > > + * CS[2], CS[3]
> > > > > + *
> > > > > + * This enables a use case where all engines are not equal and certain placement
> > > > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > > > + * case (logically contiguous placement, within a single engine class) is
> > > > > + * supported when using GuC submission. Execlist mode could support all possible
> > > > > + * bonding configurations but currently doesn't support this extension.
> > > > > + */
> > > > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > > > +/*
> > > > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > > > + * points on all hardware contexts between each set of BBs. An example use case
> > > > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > > > + */
> > > > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > > > +	__u64 flags; /* all undefined flags must be zero */
> > > > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > > > +} __attribute__ ((packed));
> > > > 
> > > > Ok I'm having some serious questions. This looks way too much like it's
> > > > inspired by bonded submission, and given we're tossing bonded submission
> > > > we need to make sure we're doing this for good independent reasons and not
> > > > just for intertia.
> > > > 
> > > 
> > > You are not wrong here, the bonding submission interface was a factor in
> > > designing this interface.
> > > 
> > > > What I expected looking at how media-driver uses bonded submit currently
> > > > is:
> > > > 
> > > > - We create a parallel submit engine, which occupies a virtual engine
> > > >   slot. This parallel virtual engine contains all the information we need,
> > > >   i.e. the flags you have above, but also how many engines run in parallel
> > > >   and how each of those can be load-balanced. So probably a full NxM
> > > >   matrix of physical engines needed.
> > > > 
> > > 
> > > Internally we need all this information broken out into individual structures,
> > > at least with the current implementation. We need N ring buffers, N timelines, N
> > > LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> > > which occupies a slot. Could we create a super object with N 'struct
> > > intel_context', sure. I'm just not sure what that buys us and IMO creates an
> > > inconsistent uAPI.
> > 
> > So if the implementation is too much work to adapt, here's a really nasty
> > trick: Currently we limit the engine slots to 64 in a gem context, because
> > that's the limit of the execbuf field. We could use the engine slots above
> > that for all these additional intel_context that we need underneath, at
> > least for execlist. Does GuC need them all too?
> > 
> > But clean approach would be to have an intel_parallal_engine struct which
> > has all these pointers internally I think.
> > 
> > Same on the high-level execbuf flow, doing all that N times is silly. So
> > again I'd assume there's one overall i915_request that tracks the parallel
> > submission, and then maybe N subordinate i915_request for each piece
> > (execlist backend definitely needs those for scheduling, I didn't check
> > about GuC).
> > 
> > Also drm/scheduler only deals with a single thing too, so that way the
> > high level code would never need to know that there's actually N things
> > underneath doing the job.
> >
> 
> Again each i915_request points to a single (and different) intel_context,
> timeline, lrc, ring, seqno, etc... The whole stack really treats these as
> individual things aside from the excl slot where we form a composite fence. Not
> saying we couldn't change this over time but initially creating a
> 'i915_super_request' would be quite the undertaking, very invasive to the mid
> layers of the stack, and not sure in the end what it buys us.
> 
> Once the parallel submit gets posted you will be able to see that it is a uAPI
> context setup extension, updates the execbuf IOCTL to accept N batches which is
> basically a for loop, and GuC backend being able to submit N batches at once -
> the mid layers are almost completely untouched.
> 
> Lastly, if we need to support the parallel submit extension as purposed for
> execlists, all we need to do is update the uAPI setup extension to configure the
> contexts. If we create a 'i915_super_request' we would have a massive rework in
> execlist backend too.

Yeah I'm fully aware that the current codebase puts us in a very awkward
corner. But also designing uapi by exposing whatever we have internally
right now is also not a good idea.

That's why I've suggested the idea to make the uapi use a single uapi
engine on the gem context, and (for now at least) internally fake it all.
Including the glorious for() loop over everything in execbuf.

> > > > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> > > >   in one go.
> > > > 
> > > 
> > > If we expose 1 or N engines it doesn't really matter, does it? Either way the
> > > entire GEM context is configured for N BBs in a single IOCTL.
> > > 
> > > > - This means we don't create virtual engines (or physical engine mappings)
> > > >   for all the individual pieces in a parallel engine. That's a concept
> > > >   from bonded submission, and I think that needs to go.
> > > > 
> > > 
> > > Again this isn't strickly true - we need N internal backing structures.
> > 
> > I didn't check the code, but iirc you said for the GuC backend you do
> > nothing until the last submit. Only then it's pushed into the GuC. That
> > sounds a bit silly, and by treating parallel submission as a single thing
> > (which might or mightnot be split in lower levels) this would go away.
> >
> 
> We update internal state on each submit, the last submit is the one to interact
> with the GuC.

Sounds very much like sunk cost fallacy driven implementation design, but
oh well.

> > But it also might be way too much churn, because there's a bunch of places
> > where we have to do this splitting. If it's all, then maybe just keeping
> > the engines around everywhere makes sense.
> > 
> > But also this is leaking implementation details into uapi, from umd pov
> > it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
> > Leaking that we treat it as N engines underneath feels like a mistake.
> >
> 
> Too be clear, changing from N slots to 1 slot isn't that big of a deal. Changing
> from N i915_requests to 1 is a *huge* deal.
> 
> N slots to 1 slots will just touch the uAPI setup extension and the execbuf
> IOCTL.
> 
> N i915_requests to 1 will ripple thoughout the entire stack. 

Yeah I think going to 1 i915_request is something we need to postpone and
decide later on whether it makes sense or not.

But making sure the uapi isn't putting roadblocks in that way is something
we need to fix now. And I do think in a clean slate world, ignoring all
the code we have and especially the current midlayer and execlist backend
code, a single ctx/request/execbuf is the right design here. Or well,
would have been.

Except if you now tell me that GuC actually wants N submissions, but my
understanding is it really just wants 1. And the only fan-out we have to
do is plug the right N batchbuffers into the right N LRC of the overall
GuC mutli-LRC context. But that seems not the case.

> > > > - More important not having a parallel virtual engine breaks our already
> > > >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> > > >   is just a container object, which points at a bunch of engines (plus the
> > > >   VM and a few other things). Having parallel context something that sits
> > > >   at the gem ctx level, and not as an individual engine (of which you can
> > > >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> > > >   sits at the gem ctx level, so that you can capture all the perf data for
> > > >   an entire workload spawning across multiple engines. If a workload now
> > > >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> > > >   this.
> > > 
> > > This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> > > love to fix this and changing a context to a single slot might be able to fix
> > > this.
> > 
> > Yeah this is essentially the main gripe I have with this. Everywhere else
> > you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
> > submit, where you submit to a gem_ctx_id and the engine slot doesn't
> > matter. That's a rather unfortunate uapi.
> > 
> 
> Yea this isn't ideal but we've kinda backed ourselves into a corner here at
> least consistency wise.
> 
> As purposed we basically have 2 steps to configure a gem context:
> 
> 1. Define placement rules (set_engines, set_load_balance)
> 2. Indicate this context is used for parallel submission (set_parallel)
> 
> What would the the uAPI look like where a each parallel context occupies a slot?
> 
> 1. Define a the number of slots (set_engines)
> 2. For each slot allow a virtual or parallel context (set_load_balance,
> set_parallel)
> 
> The set_parallel would have to contain all the placement information for 2 to N
> contexts, right? So each set_parallel is chained extension too. Now we have a
> two level chain in our IOCTL.
> 
> e.g.
> 
> set_engines (3 slots) -> set_load_balance (slot 0) -> set_parallel (slot 1) ->                                                            -> set_load_balance (slot 2)
> 									     | 								  |
> 									     > placement for context 0 -> placement for context 1, etc...->	
> 
> IMO this seems like a bigger mess but I suppose it would work.

This sounds a bit like overengineering. All we need for the parallel
virtual engines are a bunch of parameters (engine slot, num_slots,
num_siblings) and then a num_slots X num_siblings array at the end with
the placements for all combinations.

i915_user_extensions already allow for that array at the end (that's why
extensions are chained with pointers, not by being one-after-the-other in
an array), so this already works as-is. See e.g. how set_load_balance or
set_engines work, they already do that.

So purely from an uapi pov I'm not seeing the trouble?

> As you say below gem contexts are moving towards just being containers so does
> it really matter if a UMD has to create a gem context per a parallel context?
> They still can share address space, pass fences between them, etc...

Mostly my concern is that for everything else we execute stuff on an
intel_context and track it with an i915_request. Except for parallel
submission, where we have N-1 fake intel_context and fake i915_request,
and only the last one of each is actually triggering submission.

That's awkward code design at best, and I'd like to have the option at
least we can fix it in the future.

Of course we can fix it all without changing the uapi, like Jason is doing
with the proto ctx rework. But that always comes at the cost of code and
complexity, which isn't strictly needed here I think.

> If we really want to go the direction of 1 slot per parallel context I can hack
> up a PoC branch when I have time. 

I think the only impact of the minimal plan would be:

- slight change to the media-driver setup function, but that's already
  fully encapsulated (at least from a cursory look)

- creating a pile of fake contexts so that current execbuf code and
  scheduler midlayers don't panic, at engine offsets userspace can't see
  them

- some changes in the execbuf code to get at all the fake contexts and
  figure out how many batchbuffers we need

None of this should be giantic, but it keeps the door open that we can fix
the internals properly, without having to carry a compat layer around
forever.

Thoughts? If you think "not worth at all" and Jason concurs then I'm happy
to let this one slide.

Cheers, Daniel

> 
> Matt
> 
> > Now with bonded submit this made some sense (not that bonded submit itself
> > made much sense), since you did indeed submit N batchbuffers to N
> > (gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
> > one execbuf call.
> > 
> > > > So what I'd expect we'd have here is roughly:
> > > > 
> > > > struct i915_context_engines_parallel_submit {
> > > > 	struct i915_user_extension base;
> > > > 	__u64 flags;
> > > > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > > > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > > > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > > > };
> > > > 
> > > > If we then also require that you always submit the full width of N
> > > > batchbuffers then even the execbuf extension doesn't need to exist
> > > > anymore, because the virtual parallel engine already contains all the
> > > > needed information.
> > > > 
> > > > And sure for some backends at least (definitely execlist) we'd need to
> > > > create a bunch of additional virtual engines behind that virtual engine.
> > > > But they'd be entirely hidden, and not visible to userspace nor the higher
> > > > levels.
> > > >
> > > > What am I missing?
> > > 
> > > Not really, I think you got it. I think at the end of day this really comes down
> > > to do we want to allow more than 1 parallel virtual engine per gem context? If
> > > the answer is yes we collapse a parallel virtual engine into a single slot, if
> > > not we leave as is.
> > 
> > Yup. So right now media uses one gem context per engine they need. Since
> > media doesn't care about perf/OA they could get shared VM by sharing the
> > VM across gem ctx, which they already do. So probably we could get away if
> > we leave parallel engines as a gem ctx level thing.
> > 
> > Also on the media-driver code the impact is nil since it's just a
> > different chain of context extensions in the same ioctl call.
> > 
> > Bigger picture is that Jason is quite unhappy withou our gem ctx based
> > uapi, and his long term idea is to make gem ctx into a pure container
> > object with pointers to engines and a vm. And not something that has
> > relevance itself. Currently that's not the case for perf/OA, which works
> > on the gem ctx, and Jason's already unhappy about that one. So adding more
> > stuff on the gem ctx level feels a bit like a mistake.
> > 
> > Cheers, Daniel
> > 
> > > 
> > > Matt
> > > 
> > > > -Daniel
> > > > 
> > > > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > > > >  	__u64 extensions; \
> > > > >  	struct i915_engine_class_instance engines[N__]; \
> > > > > -- 
> > > > > 2.28.0
> > > > > 
> > > > > _______________________________________________
> > > > > Intel-gfx mailing list
> > > > > Intel-gfx@lists.freedesktop.org
> > > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > > 
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
  2021-05-17 13:55             ` Daniel Vetter
@ 2021-05-17 17:46               ` Matthew Brost
  -1 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-17 17:46 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Mon, May 17, 2021 at 03:55:59PM +0200, Daniel Vetter wrote:
> On Fri, May 14, 2021 at 01:05:33PM -0700, Matthew Brost wrote:
> > On Wed, May 12, 2021 at 10:34:59AM +0200, Daniel Vetter wrote:
> > > On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> > > > On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > > > > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > > > > i915_drm.h updates for 'set parallel submit' extension.
> > > > > > 
> > > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > > Cc: Tony Ye <tony.ye@intel.com>
> > > > > > CC: Carl Zhang <carl.zhang@intel.com>
> > > > > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 126 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > > > index 26d2e135aa31..0175b12b33b8 100644
> > > > > > --- a/include/uapi/drm/i915_drm.h
> > > > > > +++ b/include/uapi/drm/i915_drm.h
> > > > > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > > > > >   * Extensions:
> > > > > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > > > > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > > > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > > > > 
> > > > > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > > > > also not sure this formats very well as a nice list. Using item lists
> > > > > should look pretty nice like we're doing for the various kms properties,
> > > > > e.g.
> > > > > 
> > > > > FOO:
> > > > >   Explain what FOO does
> > > > > 
> > > > > BAR:
> > > > >   Explain what BAR does. struct bar also automatically generates a link
> > > > > 
> > > > > Please check with make htmldocs and polish this a bit (might need a small
> > > > > prep patch).
> > > > > 
> > > > 
> > > > I agree the doc should look nice. To get there I might need to chat with you on
> > > > IRC as I'm new to this. 
> > > > 
> > > > > >   */
> > > > > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > > > > >  
> > > > > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > > > > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > > > > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > > > > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > > > > >  	struct i915_engine_class_instance engines[0];
> > > > > >  } __attribute__((packed));
> > > > > >  
> > > > > > +/*
> > > > > > + * i915_context_engines_parallel_submit:
> > > > > > + *
> > > > > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > > > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > > > > + *
> > > > > > + * All hardware contexts in the engine set are configured for parallel
> > > > > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > > > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > > > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > > > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > > > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > > > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > > > > + * always submit N BBs in a single IOCTL.
> > > > > > + *
> > > > > > + * Their are two currently defined ways to control the placement of the
> > > > > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > > > + * interface below above the flags.
> > > > > > + *
> > > > > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > > > > + * placement configuration isn't supported on the platform / submission
> > > > > > + * interface.
> > > > > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > > > > + * inteface.
> > > > > > + */
> > > > > > +struct i915_context_engines_parallel_submit {
> > > > > > +	struct i915_user_extension base;
> > > > > 
> > > > > Ok this is good, since it makes sure we can't possible use this in
> > > > > CTX_SETPARAM.
> > > > > 
> > > > 
> > > > Yep, this is at context creation time. Technically you still can call this over
> > > > and over on the same gem context but Jason is taking that ability away I
> > > > believe. I've also told the media team to setup the context once and don't touch
> > > > it again.
> > > 
> > > Only if you base your context param on drm_i915_gem_context_param, which
> > > can be used both at create time with
> > > drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
> > > But you don't, so this issue is fixed at the uapi design and doesn't need
> > > to interface with Jason's prot-ctx rework much.
> > > 
> > > There's still going to be some conflicts, so maybe ask Jason for a branch
> > > and rebase GuC on top of that for the next round.
> > > 
> > 
> > Certainly this new uAPI is going conflict. The basic GuC submission code
> > shouldn't though as it doesn't touch the uAPI code at all. By the time the new
> > uAPI is posted I'd hope Jason's proto-ctx rework has landed and will rebase
> > then on to the tip of DRM.
> 
> Ah yes. Another good reasons to split that up into two parts, like we've
> already planned to.
> 

Yep.

> > > > > > +
> > > > > > +/*
> > > > > > + * Default placement behvavior (currently unsupported):
> > > > > > + *
> > > > > > + * Rather than restricting parallel submission to a single class with a
> > > > > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > > > > + * enables parallel submission across multiple engine classes. In this case each
> > > > > > + * context's logical engine mask indicates where that context can placed. It is
> > > > > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > > > > + * if one context is running CS0 no other contexts can run on CS0).
> > > > > > + *
> > > > > > + * Example 1 pseudo code:
> > > > > > + * CSX[Y] = engine class X, logical instance Y
> > > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > > + * set_engines(INVALID, INVALID)
> > > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > > > > + * set_parallel()
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS0[0], CS1[0]
> > > > > > + * CS0[0], CS1[1]
> > > > > > + * CS0[1], CS1[0]
> > > > > > + * CS0[1], CS1[1]
> > > > > > + *
> > > > > > + * Example 2 pseudo code:
> > > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > > + * set_engines(INVALID, INVALID)
> > > > > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > > + * set_parallel()
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS[0], CS[1]
> > > > > > + * CS[0], CS[2]
> > > > > > + * CS[1], CS[0]
> > > > > > + * CS[1], CS[2]
> > > > > > + * CS[2], CS[0]
> > > > > > + * CS[2], CS[1]
> > > > > > + *
> > > > > > + * This enables a use case where all engines are created equally, we don't care
> > > > > > + * where they are scheduled, we just want a certain number of resources, for
> > > > > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > > > > + * engine classes.
> > > > > > + */
> > > > > > +
> > > > > > +/*
> > > > > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > > > > + * Each context must have the same number sibling and bonds are implictly create
> > > > > > + * of the siblings.
> > > > > > + *
> > > > > > + * All of the below examples are in logical space.
> > > > > > + *
> > > > > > + * Example 1 pseudo code:
> > > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > > + * set_engines(CS[0], CS[1])
> > > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS[0], CS[1]
> > > > > > + *
> 
> > > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > > + * set_engines(INVALID, INVALID)
> > > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS[0], CS[1]
> > > > > > + * CS[2], CS[3]
> > > > > > + *
> > > > > > + * This enables a use case where all engines are not equal and certain placement
> > > > > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > > > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > > > > + * case (logically contiguous placement, within a single engine class) is
> > > > > > + * supported when using GuC submission. Execlist mode could support all possible
> > > > > > + * bonding configurations but currently doesn't support this extension.
> > > > > > + */
> > > > > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > > > > +/*
> > > > > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > > > > + * points on all hardware contexts between each set of BBs. An example use case
> > > > > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > > > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > > > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > > > > + */
> > > > > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > > > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > > > > +	__u64 flags; /* all undefined flags must be zero */
> > > > > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > > > > +} __attribute__ ((packed));
> > > > > 
> > > > > Ok I'm having some serious questions. This looks way too much like it's
> > > > > inspired by bonded submission, and given we're tossing bonded submission
> > > > > we need to make sure we're doing this for good independent reasons and not
> > > > > just for intertia.
> > > > > 
> > > > 
> > > > You are not wrong here, the bonding submission interface was a factor in
> > > > designing this interface.
> > > > 
> > > > > What I expected looking at how media-driver uses bonded submit currently
> > > > > is:
> > > > > 
> > > > > - We create a parallel submit engine, which occupies a virtual engine
> > > > >   slot. This parallel virtual engine contains all the information we need,
> > > > >   i.e. the flags you have above, but also how many engines run in parallel
> > > > >   and how each of those can be load-balanced. So probably a full NxM
> > > > >   matrix of physical engines needed.
> > > > > 
> > > > 
> > > > Internally we need all this information broken out into individual structures,
> > > > at least with the current implementation. We need N ring buffers, N timelines, N
> > > > LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> > > > which occupies a slot. Could we create a super object with N 'struct
> > > > intel_context', sure. I'm just not sure what that buys us and IMO creates an
> > > > inconsistent uAPI.
> > > 
> > > So if the implementation is too much work to adapt, here's a really nasty
> > > trick: Currently we limit the engine slots to 64 in a gem context, because
> > > that's the limit of the execbuf field. We could use the engine slots above
> > > that for all these additional intel_context that we need underneath, at
> > > least for execlist. Does GuC need them all too?
> > > 
> > > But clean approach would be to have an intel_parallal_engine struct which
> > > has all these pointers internally I think.
> > > 
> > > Same on the high-level execbuf flow, doing all that N times is silly. So
> > > again I'd assume there's one overall i915_request that tracks the parallel
> > > submission, and then maybe N subordinate i915_request for each piece
> > > (execlist backend definitely needs those for scheduling, I didn't check
> > > about GuC).
> > > 
> > > Also drm/scheduler only deals with a single thing too, so that way the
> > > high level code would never need to know that there's actually N things
> > > underneath doing the job.
> > >
> > 
> > Again each i915_request points to a single (and different) intel_context,
> > timeline, lrc, ring, seqno, etc... The whole stack really treats these as
> > individual things aside from the excl slot where we form a composite fence. Not
> > saying we couldn't change this over time but initially creating a
> > 'i915_super_request' would be quite the undertaking, very invasive to the mid
> > layers of the stack, and not sure in the end what it buys us.
> > 
> > Once the parallel submit gets posted you will be able to see that it is a uAPI
> > context setup extension, updates the execbuf IOCTL to accept N batches which is
> > basically a for loop, and GuC backend being able to submit N batches at once -
> > the mid layers are almost completely untouched.
> > 
> > Lastly, if we need to support the parallel submit extension as purposed for
> > execlists, all we need to do is update the uAPI setup extension to configure the
> > contexts. If we create a 'i915_super_request' we would have a massive rework in
> > execlist backend too.
> 
> Yeah I'm fully aware that the current codebase puts us in a very awkward
> corner. But also designing uapi by exposing whatever we have internally
> right now is also not a good idea.
> 

Agree our internals shouldn't dictate our uAPI.

> That's why I've suggested the idea to make the uapi use a single uapi
> engine on the gem context, and (for now at least) internally fake it all.
> Including the glorious for() loop over everything in execbuf.
> 
> > > > > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> > > > >   in one go.
> > > > > 
> > > > 
> > > > If we expose 1 or N engines it doesn't really matter, does it? Either way the
> > > > entire GEM context is configured for N BBs in a single IOCTL.
> > > > 
> > > > > - This means we don't create virtual engines (or physical engine mappings)
> > > > >   for all the individual pieces in a parallel engine. That's a concept
> > > > >   from bonded submission, and I think that needs to go.
> > > > > 
> > > > 
> > > > Again this isn't strickly true - we need N internal backing structures.
> > > 
> > > I didn't check the code, but iirc you said for the GuC backend you do
> > > nothing until the last submit. Only then it's pushed into the GuC. That
> > > sounds a bit silly, and by treating parallel submission as a single thing
> > > (which might or mightnot be split in lower levels) this would go away.
> > >
> > 
> > We update internal state on each submit, the last submit is the one to interact
> > with the GuC.
> 
> Sounds very much like sunk cost fallacy driven implementation design, but
> oh well.
> 
> > > But it also might be way too much churn, because there's a bunch of places
> > > where we have to do this splitting. If it's all, then maybe just keeping
> > > the engines around everywhere makes sense.
> > > 
> > > But also this is leaking implementation details into uapi, from umd pov
> > > it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
> > > Leaking that we treat it as N engines underneath feels like a mistake.
> > >
> > 
> > Too be clear, changing from N slots to 1 slot isn't that big of a deal. Changing
> > from N i915_requests to 1 is a *huge* deal.
> > 
> > N slots to 1 slots will just touch the uAPI setup extension and the execbuf
> > IOCTL.
> > 
> > N i915_requests to 1 will ripple thoughout the entire stack. 
> 
> Yeah I think going to 1 i915_request is something we need to postpone and
> decide later on whether it makes sense or not.
> 

Glad we are on the same page about this - we can revisit this later.

> But making sure the uapi isn't putting roadblocks in that way is something
> we need to fix now. And I do think in a clean slate world, ignoring all
> the code we have and especially the current midlayer and execlist backend
> code, a single ctx/request/execbuf is the right design here. Or well,
> would have been.
> 
> Except if you now tell me that GuC actually wants N submissions, but my
> understanding is it really just wants 1. And the only fan-out we have to
> do is plug the right N batchbuffers into the right N LRC of the overall
> GuC mutli-LRC context. But that seems not the case.
> 
> > > > > - More important not having a parallel virtual engine breaks our already
> > > > >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> > > > >   is just a container object, which points at a bunch of engines (plus the
> > > > >   VM and a few other things). Having parallel context something that sits
> > > > >   at the gem ctx level, and not as an individual engine (of which you can
> > > > >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> > > > >   sits at the gem ctx level, so that you can capture all the perf data for
> > > > >   an entire workload spawning across multiple engines. If a workload now
> > > > >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> > > > >   this.
> > > > 
> > > > This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> > > > love to fix this and changing a context to a single slot might be able to fix
> > > > this.
> > > 
> > > Yeah this is essentially the main gripe I have with this. Everywhere else
> > > you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
> > > submit, where you submit to a gem_ctx_id and the engine slot doesn't
> > > matter. That's a rather unfortunate uapi.
> > > 
> > 
> > Yea this isn't ideal but we've kinda backed ourselves into a corner here at
> > least consistency wise.
> > 
> > As purposed we basically have 2 steps to configure a gem context:
> > 
> > 1. Define placement rules (set_engines, set_load_balance)
> > 2. Indicate this context is used for parallel submission (set_parallel)
> > 
> > What would the the uAPI look like where a each parallel context occupies a slot?
> > 
> > 1. Define a the number of slots (set_engines)
> > 2. For each slot allow a virtual or parallel context (set_load_balance,
> > set_parallel)
> > 
> > The set_parallel would have to contain all the placement information for 2 to N
> > contexts, right? So each set_parallel is chained extension too. Now we have a
> > two level chain in our IOCTL.
> > 
> > e.g.
> > 
> > set_engines (3 slots) -> set_load_balance (slot 0) -> set_parallel (slot 1) ->                                                            -> set_load_balance (slot 2)
> > 									     | 								  |
> > 									     > placement for context 0 -> placement for context 1, etc...->	
> > 
> > IMO this seems like a bigger mess but I suppose it would work.
> 
> This sounds a bit like overengineering. All we need for the parallel
> virtual engines are a bunch of parameters (engine slot, num_slots,
> num_siblings) and then a num_slots X num_siblings array at the end with
> the placements for all combinations.
> 
> i915_user_extensions already allow for that array at the end (that's why
> extensions are chained with pointers, not by being one-after-the-other in
> an array), so this already works as-is. See e.g. how set_load_balance or
> set_engines work, they already do that.
> 
> So purely from an uapi pov I'm not seeing the trouble?
>

Ah, I forgot that each user extension can be variable sized. Making each
parallel engine into 1 slot and single setup extension does indeed make sense
and should be easy enough to do. I'll get a quick PoC of this internally to
flush out any issues before my next RFC post with all the uAPI details.
 
> > As you say below gem contexts are moving towards just being containers so does
> > it really matter if a UMD has to create a gem context per a parallel context?
> > They still can share address space, pass fences between them, etc...
> 
> Mostly my concern is that for everything else we execute stuff on an
> intel_context and track it with an i915_request. Except for parallel
> submission, where we have N-1 fake intel_context and fake i915_request,
> and only the last one of each is actually triggering submission.
> 
> That's awkward code design at best, and I'd like to have the option at
> least we can fix it in the future.
> 
> Of course we can fix it all without changing the uapi, like Jason is doing
> with the proto ctx rework. But that always comes at the cost of code and
> complexity, which isn't strictly needed here I think.
> 
> > If we really want to go the direction of 1 slot per parallel context I can hack
> > up a PoC branch when I have time. 
> 
> I think the only impact of the minimal plan would be:
> 
> - slight change to the media-driver setup function, but that's already
>   fully encapsulated (at least from a cursory look)
> 
> - creating a pile of fake contexts so that current execbuf code and
>   scheduler midlayers don't panic, at engine offsets userspace can't see
>   them
> 
> - some changes in the execbuf code to get at all the fake contexts and
>   figure out how many batchbuffers we need
> 
> None of this should be giantic, but it keeps the door open that we can fix
> the internals properly, without having to carry a compat layer around
> forever.
> 
> Thoughts? If you think "not worth at all" and Jason concurs then I'm happy
> to let this one slide.

I agree with this plan, let's execute it.

Matt

> 
> Cheers, Daniel
> 
> > 
> > Matt
> > 
> > > Now with bonded submit this made some sense (not that bonded submit itself
> > > made much sense), since you did indeed submit N batchbuffers to N
> > > (gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
> > > one execbuf call.
> > > 
> > > > > So what I'd expect we'd have here is roughly:
> > > > > 
> > > > > struct i915_context_engines_parallel_submit {
> > > > > 	struct i915_user_extension base;
> > > > > 	__u64 flags;
> > > > > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > > > > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > > > > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > > > > };
> > > > > 
> > > > > If we then also require that you always submit the full width of N
> > > > > batchbuffers then even the execbuf extension doesn't need to exist
> > > > > anymore, because the virtual parallel engine already contains all the
> > > > > needed information.
> > > > > 
> > > > > And sure for some backends at least (definitely execlist) we'd need to
> > > > > create a bunch of additional virtual engines behind that virtual engine.
> > > > > But they'd be entirely hidden, and not visible to userspace nor the higher
> > > > > levels.
> > > > >
> > > > > What am I missing?
> > > > 
> > > > Not really, I think you got it. I think at the end of day this really comes down
> > > > to do we want to allow more than 1 parallel virtual engine per gem context? If
> > > > the answer is yes we collapse a parallel virtual engine into a single slot, if
> > > > not we leave as is.
> > > 
> > > Yup. So right now media uses one gem context per engine they need. Since
> > > media doesn't care about perf/OA they could get shared VM by sharing the
> > > VM across gem ctx, which they already do. So probably we could get away if
> > > we leave parallel engines as a gem ctx level thing.
> > > 
> > > Also on the media-driver code the impact is nil since it's just a
> > > different chain of context extensions in the same ioctl call.
> > > 
> > > Bigger picture is that Jason is quite unhappy withou our gem ctx based
> > > uapi, and his long term idea is to make gem ctx into a pure container
> > > object with pointers to engines and a vm. And not something that has
> > > relevance itself. Currently that's not the case for perf/OA, which works
> > > on the gem ctx, and Jason's already unhappy about that one. So adding more
> > > stuff on the gem ctx level feels a bit like a mistake.
> > > 
> > > Cheers, Daniel
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > > > > >  	__u64 extensions; \
> > > > > >  	struct i915_engine_class_instance engines[N__]; \
> > > > > > -- 
> > > > > > 2.28.0
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Intel-gfx mailing list
> > > > > > Intel-gfx@lists.freedesktop.org
> > > > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > > > 
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension
@ 2021-05-17 17:46               ` Matthew Brost
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Brost @ 2021-05-17 17:46 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: jason.ekstrand, daniel.vetter, intel-gfx, dri-devel, carl.zhang

On Mon, May 17, 2021 at 03:55:59PM +0200, Daniel Vetter wrote:
> On Fri, May 14, 2021 at 01:05:33PM -0700, Matthew Brost wrote:
> > On Wed, May 12, 2021 at 10:34:59AM +0200, Daniel Vetter wrote:
> > > On Tue, May 11, 2021 at 11:44:28AM -0700, Matthew Brost wrote:
> > > > On Tue, May 11, 2021 at 05:11:44PM +0200, Daniel Vetter wrote:
> > > > > On Thu, May 06, 2021 at 10:30:48AM -0700, Matthew Brost wrote:
> > > > > > i915_drm.h updates for 'set parallel submit' extension.
> > > > > > 
> > > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > > Cc: Tony Ye <tony.ye@intel.com>
> > > > > > CC: Carl Zhang <carl.zhang@intel.com>
> > > > > > Cc: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > Cc: Jason Ekstrand <jason@jlekstrand.net>
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  include/uapi/drm/i915_drm.h | 126 ++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 126 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > > > index 26d2e135aa31..0175b12b33b8 100644
> > > > > > --- a/include/uapi/drm/i915_drm.h
> > > > > > +++ b/include/uapi/drm/i915_drm.h
> > > > > > @@ -1712,6 +1712,7 @@ struct drm_i915_gem_context_param {
> > > > > >   * Extensions:
> > > > > >   *   i915_context_engines_load_balance (I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE)
> > > > > >   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
> > > > > > + *   i915_context_engines_parallel_submit (I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT)
> > > > > 
> > > > > Hm just relalized, but I don't think this hyperlinsk correctly, and I'm
> > > > > also not sure this formats very well as a nice list. Using item lists
> > > > > should look pretty nice like we're doing for the various kms properties,
> > > > > e.g.
> > > > > 
> > > > > FOO:
> > > > >   Explain what FOO does
> > > > > 
> > > > > BAR:
> > > > >   Explain what BAR does. struct bar also automatically generates a link
> > > > > 
> > > > > Please check with make htmldocs and polish this a bit (might need a small
> > > > > prep patch).
> > > > > 
> > > > 
> > > > I agree the doc should look nice. To get there I might need to chat with you on
> > > > IRC as I'm new to this. 
> > > > 
> > > > > >   */
> > > > > >  #define I915_CONTEXT_PARAM_ENGINES	0xa
> > > > > >  
> > > > > > @@ -1894,9 +1895,134 @@ struct i915_context_param_engines {
> > > > > >  	__u64 extensions; /* linked chain of extension blocks, 0 terminates */
> > > > > >  #define I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE 0 /* see i915_context_engines_load_balance */
> > > > > >  #define I915_CONTEXT_ENGINES_EXT_BOND 1 /* see i915_context_engines_bond */
> > > > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see i915_context_engines_parallel_submit */
> > > > > >  	struct i915_engine_class_instance engines[0];
> > > > > >  } __attribute__((packed));
> > > > > >  
> > > > > > +/*
> > > > > > + * i915_context_engines_parallel_submit:
> > > > > > + *
> > > > > > + * Setup a gem context to allow multiple BBs to be submitted in a single execbuf
> > > > > > + * IOCTL. Those BBs will then be scheduled to run on the GPU in parallel.
> > > > > > + *
> > > > > > + * All hardware contexts in the engine set are configured for parallel
> > > > > > + * submission (i.e. once this gem context is configured for parallel submission,
> > > > > > + * all the hardware contexts, regardless if a BB is available on each individual
> > > > > > + * context, will be submitted to the GPU in parallel). A user can submit BBs to
> > > > > > + * subset of the hardware contexts, in a single execbuf IOCTL, but it is not
> > > > > > + * recommended as it may reserve physical engines with nothing to run on them.
> > > > > > + * Highly recommended to configure the gem context with N hardware contexts then
> > > > > > + * always submit N BBs in a single IOCTL.
> > > > > > + *
> > > > > > + * Their are two currently defined ways to control the placement of the
> > > > > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> > > > > > + * future as new hardware / use cases arise. Details of how to use this
> > > > > > + * interface below above the flags.
> > > > > > + *
> > > > > > + * Returns -EINVAL if hardware context placement configuration invalid or if the
> > > > > > + * placement configuration isn't supported on the platform / submission
> > > > > > + * interface.
> > > > > > + * Returns -ENODEV if extension isn't supported on the platform / submission
> > > > > > + * inteface.
> > > > > > + */
> > > > > > +struct i915_context_engines_parallel_submit {
> > > > > > +	struct i915_user_extension base;
> > > > > 
> > > > > Ok this is good, since it makes sure we can't possible use this in
> > > > > CTX_SETPARAM.
> > > > > 
> > > > 
> > > > Yep, this is at context creation time. Technically you still can call this over
> > > > and over on the same gem context but Jason is taking that ability away I
> > > > believe. I've also told the media team to setup the context once and don't touch
> > > > it again.
> > > 
> > > Only if you base your context param on drm_i915_gem_context_param, which
> > > can be used both at create time with
> > > drm_i915_gem_context_create_ext_setparam and with the CTX_SETPARAM ioctl.
> > > But you don't, so this issue is fixed at the uapi design and doesn't need
> > > to interface with Jason's prot-ctx rework much.
> > > 
> > > There's still going to be some conflicts, so maybe ask Jason for a branch
> > > and rebase GuC on top of that for the next round.
> > > 
> > 
> > Certainly this new uAPI is going conflict. The basic GuC submission code
> > shouldn't though as it doesn't touch the uAPI code at all. By the time the new
> > uAPI is posted I'd hope Jason's proto-ctx rework has landed and will rebase
> > then on to the tip of DRM.
> 
> Ah yes. Another good reasons to split that up into two parts, like we've
> already planned to.
> 

Yep.

> > > > > > +
> > > > > > +/*
> > > > > > + * Default placement behvavior (currently unsupported):
> > > > > > + *
> > > > > > + * Rather than restricting parallel submission to a single class with a
> > > > > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode that
> > > > > > + * enables parallel submission across multiple engine classes. In this case each
> > > > > > + * context's logical engine mask indicates where that context can placed. It is
> > > > > > + * implied in this mode that all contexts have mutual exclusive placement (e.g.
> > > > > > + * if one context is running CS0 no other contexts can run on CS0).
> > > > > > + *
> > > > > > + * Example 1 pseudo code:
> > > > > > + * CSX[Y] = engine class X, logical instance Y
> > > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > > + * set_engines(INVALID, INVALID)
> > > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS0[0],CS0[1])
> > > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS1[0],CS1[1])
> > > > > > + * set_parallel()
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS0[0], CS1[0]
> > > > > > + * CS0[0], CS1[1]
> > > > > > + * CS0[1], CS1[0]
> > > > > > + * CS0[1], CS1[1]
> > > > > > + *
> > > > > > + * Example 2 pseudo code:
> > > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > > + * set_engines(INVALID, INVALID)
> > > > > > + * set_load_balance(engine_index=0, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > > + * set_load_balance(engine_index=1, num_siblings=3, engines=CS[0],CS[1],CS[2])
> > > > > > + * set_parallel()
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS[0], CS[1]
> > > > > > + * CS[0], CS[2]
> > > > > > + * CS[1], CS[0]
> > > > > > + * CS[1], CS[2]
> > > > > > + * CS[2], CS[0]
> > > > > > + * CS[2], CS[1]
> > > > > > + *
> > > > > > + * This enables a use case where all engines are created equally, we don't care
> > > > > > + * where they are scheduled, we just want a certain number of resources, for
> > > > > > + * those resources to be scheduled in parallel, and possibly across multiple
> > > > > > + * engine classes.
> > > > > > + */
> > > > > > +
> > > > > > +/*
> > > > > > + * I915_PARALLEL_IMPLICT_BONDS - Create implict bonds between each context.
> > > > > > + * Each context must have the same number sibling and bonds are implictly create
> > > > > > + * of the siblings.
> > > > > > + *
> > > > > > + * All of the below examples are in logical space.
> > > > > > + *
> > > > > > + * Example 1 pseudo code:
> > > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > > + * set_engines(CS[0], CS[1])
> > > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS[0], CS[1]
> > > > > > + *
> 
> > > > > > + * CS[X] = generic engine of same class, logical instance X
> > > > > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > > > > + * set_engines(INVALID, INVALID)
> > > > > > + * set_load_balance(engine_index=0, num_siblings=2, engines=CS[0],CS[2])
> > > > > > + * set_load_balance(engine_index=1, num_siblings=2, engines=CS[1],CS[3])
> > > > > > + * set_parallel(flags=I915_PARALLEL_IMPLICT_BONDS)
> > > > > > + *
> > > > > > + * Results in the following valid placements:
> > > > > > + * CS[0], CS[1]
> > > > > > + * CS[2], CS[3]
> > > > > > + *
> > > > > > + * This enables a use case where all engines are not equal and certain placement
> > > > > > + * rules are required (i.e. split-frame requires all contexts to be placed in a
> > > > > > + * logically contiguous order on the VCS engines on gen11+ platforms). This use
> > > > > > + * case (logically contiguous placement, within a single engine class) is
> > > > > > + * supported when using GuC submission. Execlist mode could support all possible
> > > > > > + * bonding configurations but currently doesn't support this extension.
> > > > > > + */
> > > > > > +#define I915_PARALLEL_IMPLICT_BONDS		(1<<0)
> > > > > > +/*
> > > > > > + * Do not allow BBs to be preempted mid BB rather insert coordinated preemption
> > > > > > + * points on all hardware contexts between each set of BBs. An example use case
> > > > > > + * of this feature is split-frame on gen11+ hardware. When using this feature a
> > > > > > + * BB must be submitted on each hardware context in the parallel gem context.
> > > > > > + * The execbuf2 IOCTL enforces the user adheres to policy.
> > > > > > + */
> > > > > > +#define I915_PARALLEL_NO_PREEMPT_MID_BATCH	(1<<1)
> > > > > > +#define I915_PARALLEL_UNKNOWN_FLAGS  (-(I915_PARALLEL_NO_PREEMPT_MID_BATCH << 1))
> > > > > > +	__u64 flags; /* all undefined flags must be zero */
> > > > > > +	__u64 mbz64[4]; /* reserved for future use; must be zero */
> > > > > > +} __attribute__ ((packed));
> > > > > 
> > > > > Ok I'm having some serious questions. This looks way too much like it's
> > > > > inspired by bonded submission, and given we're tossing bonded submission
> > > > > we need to make sure we're doing this for good independent reasons and not
> > > > > just for intertia.
> > > > > 
> > > > 
> > > > You are not wrong here, the bonding submission interface was a factor in
> > > > designing this interface.
> > > > 
> > > > > What I expected looking at how media-driver uses bonded submit currently
> > > > > is:
> > > > > 
> > > > > - We create a parallel submit engine, which occupies a virtual engine
> > > > >   slot. This parallel virtual engine contains all the information we need,
> > > > >   i.e. the flags you have above, but also how many engines run in parallel
> > > > >   and how each of those can be load-balanced. So probably a full NxM
> > > > >   matrix of physical engines needed.
> > > > > 
> > > > 
> > > > Internally we need all this information broken out into individual structures,
> > > > at least with the current implementation. We need N ring buffers, N timelines, N
> > > > LRCs, N HWSPs, etc... All of this is encapsulated by a 'struct intel_context'
> > > > which occupies a slot. Could we create a super object with N 'struct
> > > > intel_context', sure. I'm just not sure what that buys us and IMO creates an
> > > > inconsistent uAPI.
> > > 
> > > So if the implementation is too much work to adapt, here's a really nasty
> > > trick: Currently we limit the engine slots to 64 in a gem context, because
> > > that's the limit of the execbuf field. We could use the engine slots above
> > > that for all these additional intel_context that we need underneath, at
> > > least for execlist. Does GuC need them all too?
> > > 
> > > But clean approach would be to have an intel_parallal_engine struct which
> > > has all these pointers internally I think.
> > > 
> > > Same on the high-level execbuf flow, doing all that N times is silly. So
> > > again I'd assume there's one overall i915_request that tracks the parallel
> > > submission, and then maybe N subordinate i915_request for each piece
> > > (execlist backend definitely needs those for scheduling, I didn't check
> > > about GuC).
> > > 
> > > Also drm/scheduler only deals with a single thing too, so that way the
> > > high level code would never need to know that there's actually N things
> > > underneath doing the job.
> > >
> > 
> > Again each i915_request points to a single (and different) intel_context,
> > timeline, lrc, ring, seqno, etc... The whole stack really treats these as
> > individual things aside from the excl slot where we form a composite fence. Not
> > saying we couldn't change this over time but initially creating a
> > 'i915_super_request' would be quite the undertaking, very invasive to the mid
> > layers of the stack, and not sure in the end what it buys us.
> > 
> > Once the parallel submit gets posted you will be able to see that it is a uAPI
> > context setup extension, updates the execbuf IOCTL to accept N batches which is
> > basically a for loop, and GuC backend being able to submit N batches at once -
> > the mid layers are almost completely untouched.
> > 
> > Lastly, if we need to support the parallel submit extension as purposed for
> > execlists, all we need to do is update the uAPI setup extension to configure the
> > contexts. If we create a 'i915_super_request' we would have a massive rework in
> > execlist backend too.
> 
> Yeah I'm fully aware that the current codebase puts us in a very awkward
> corner. But also designing uapi by exposing whatever we have internally
> right now is also not a good idea.
> 

Agree our internals shouldn't dictate our uAPI.

> That's why I've suggested the idea to make the uapi use a single uapi
> engine on the gem context, and (for now at least) internally fake it all.
> Including the glorious for() loop over everything in execbuf.
> 
> > > > > - Execbuf uses that parallel virtual engine to submit all N batchbuffers
> > > > >   in one go.
> > > > > 
> > > > 
> > > > If we expose 1 or N engines it doesn't really matter, does it? Either way the
> > > > entire GEM context is configured for N BBs in a single IOCTL.
> > > > 
> > > > > - This means we don't create virtual engines (or physical engine mappings)
> > > > >   for all the individual pieces in a parallel engine. That's a concept
> > > > >   from bonded submission, and I think that needs to go.
> > > > > 
> > > > 
> > > > Again this isn't strickly true - we need N internal backing structures.
> > > 
> > > I didn't check the code, but iirc you said for the GuC backend you do
> > > nothing until the last submit. Only then it's pushed into the GuC. That
> > > sounds a bit silly, and by treating parallel submission as a single thing
> > > (which might or mightnot be split in lower levels) this would go away.
> > >
> > 
> > We update internal state on each submit, the last submit is the one to interact
> > with the GuC.
> 
> Sounds very much like sunk cost fallacy driven implementation design, but
> oh well.
> 
> > > But it also might be way too much churn, because there's a bunch of places
> > > where we have to do this splitting. If it's all, then maybe just keeping
> > > the engines around everywhere makes sense.
> > > 
> > > But also this is leaking implementation details into uapi, from umd pov
> > > it's really 1 virtual engine that gets 1 execbuf call to submit N batches.
> > > Leaking that we treat it as N engines underneath feels like a mistake.
> > >
> > 
> > Too be clear, changing from N slots to 1 slot isn't that big of a deal. Changing
> > from N i915_requests to 1 is a *huge* deal.
> > 
> > N slots to 1 slots will just touch the uAPI setup extension and the execbuf
> > IOCTL.
> > 
> > N i915_requests to 1 will ripple thoughout the entire stack. 
> 
> Yeah I think going to 1 i915_request is something we need to postpone and
> decide later on whether it makes sense or not.
> 

Glad we are on the same page about this - we can revisit this later.

> But making sure the uapi isn't putting roadblocks in that way is something
> we need to fix now. And I do think in a clean slate world, ignoring all
> the code we have and especially the current midlayer and execlist backend
> code, a single ctx/request/execbuf is the right design here. Or well,
> would have been.
> 
> Except if you now tell me that GuC actually wants N submissions, but my
> understanding is it really just wants 1. And the only fan-out we have to
> do is plug the right N batchbuffers into the right N LRC of the overall
> GuC mutli-LRC context. But that seems not the case.
> 
> > > > > - More important not having a parallel virtual engine breaks our already
> > > > >   badly confusing gem ctx api. Ignoring parallel/bonded submit the gem ctx
> > > > >   is just a container object, which points at a bunch of engines (plus the
> > > > >   VM and a few other things). Having parallel context something that sits
> > > > >   at the gem ctx level, and not as an individual engine (of which you can
> > > > >   have multiple in the same gem ctx) breaks stuff. E.g. right the perf api
> > > > >   sits at the gem ctx level, so that you can capture all the perf data for
> > > > >   an entire workload spawning across multiple engines. If a workload now
> > > > >   needs multiple parallel engines we'd need multiple gem ctx, which breaks
> > > > >   this.
> > > > 
> > > > This uAPI allows only 1 parallel context per gem context which isn't ideal. I'd
> > > > love to fix this and changing a context to a single slot might be able to fix
> > > > this.
> > > 
> > > Yeah this is essentially the main gripe I have with this. Everywhere else
> > > you submit to a (gem_ctx_id, engine_slot) pair. Except for parallel
> > > submit, where you submit to a gem_ctx_id and the engine slot doesn't
> > > matter. That's a rather unfortunate uapi.
> > > 
> > 
> > Yea this isn't ideal but we've kinda backed ourselves into a corner here at
> > least consistency wise.
> > 
> > As purposed we basically have 2 steps to configure a gem context:
> > 
> > 1. Define placement rules (set_engines, set_load_balance)
> > 2. Indicate this context is used for parallel submission (set_parallel)
> > 
> > What would the the uAPI look like where a each parallel context occupies a slot?
> > 
> > 1. Define a the number of slots (set_engines)
> > 2. For each slot allow a virtual or parallel context (set_load_balance,
> > set_parallel)
> > 
> > The set_parallel would have to contain all the placement information for 2 to N
> > contexts, right? So each set_parallel is chained extension too. Now we have a
> > two level chain in our IOCTL.
> > 
> > e.g.
> > 
> > set_engines (3 slots) -> set_load_balance (slot 0) -> set_parallel (slot 1) ->                                                            -> set_load_balance (slot 2)
> > 									     | 								  |
> > 									     > placement for context 0 -> placement for context 1, etc...->	
> > 
> > IMO this seems like a bigger mess but I suppose it would work.
> 
> This sounds a bit like overengineering. All we need for the parallel
> virtual engines are a bunch of parameters (engine slot, num_slots,
> num_siblings) and then a num_slots X num_siblings array at the end with
> the placements for all combinations.
> 
> i915_user_extensions already allow for that array at the end (that's why
> extensions are chained with pointers, not by being one-after-the-other in
> an array), so this already works as-is. See e.g. how set_load_balance or
> set_engines work, they already do that.
> 
> So purely from an uapi pov I'm not seeing the trouble?
>

Ah, I forgot that each user extension can be variable sized. Making each
parallel engine into 1 slot and single setup extension does indeed make sense
and should be easy enough to do. I'll get a quick PoC of this internally to
flush out any issues before my next RFC post with all the uAPI details.
 
> > As you say below gem contexts are moving towards just being containers so does
> > it really matter if a UMD has to create a gem context per a parallel context?
> > They still can share address space, pass fences between them, etc...
> 
> Mostly my concern is that for everything else we execute stuff on an
> intel_context and track it with an i915_request. Except for parallel
> submission, where we have N-1 fake intel_context and fake i915_request,
> and only the last one of each is actually triggering submission.
> 
> That's awkward code design at best, and I'd like to have the option at
> least we can fix it in the future.
> 
> Of course we can fix it all without changing the uapi, like Jason is doing
> with the proto ctx rework. But that always comes at the cost of code and
> complexity, which isn't strictly needed here I think.
> 
> > If we really want to go the direction of 1 slot per parallel context I can hack
> > up a PoC branch when I have time. 
> 
> I think the only impact of the minimal plan would be:
> 
> - slight change to the media-driver setup function, but that's already
>   fully encapsulated (at least from a cursory look)
> 
> - creating a pile of fake contexts so that current execbuf code and
>   scheduler midlayers don't panic, at engine offsets userspace can't see
>   them
> 
> - some changes in the execbuf code to get at all the fake contexts and
>   figure out how many batchbuffers we need
> 
> None of this should be giantic, but it keeps the door open that we can fix
> the internals properly, without having to carry a compat layer around
> forever.
> 
> Thoughts? If you think "not worth at all" and Jason concurs then I'm happy
> to let this one slide.

I agree with this plan, let's execute it.

Matt

> 
> Cheers, Daniel
> 
> > 
> > Matt
> > 
> > > Now with bonded submit this made some sense (not that bonded submit itself
> > > made much sense), since you did indeed submit N batchbuffers to N
> > > (gem_ctx_id, engine_slot) pairs. But with parallel submit it's really just
> > > one execbuf call.
> > > 
> > > > > So what I'd expect we'd have here is roughly:
> > > > > 
> > > > > struct i915_context_engines_parallel_submit {
> > > > > 	struct i915_user_extension base;
> > > > > 	__u64 flags;
> > > > > 	__u32 num_engines; /* N, must match what we submit in the execbuf */
> > > > > 	__u32 num_siblings; /* M, I'm assuming it's ok we require that siblings must match across the entire set of parallel engines */
> > > > > 	struct engine_info[]; /* NxM array of engine infos, pls fill in the right struct name :-) */
> > > > > };
> > > > > 
> > > > > If we then also require that you always submit the full width of N
> > > > > batchbuffers then even the execbuf extension doesn't need to exist
> > > > > anymore, because the virtual parallel engine already contains all the
> > > > > needed information.
> > > > > 
> > > > > And sure for some backends at least (definitely execlist) we'd need to
> > > > > create a bunch of additional virtual engines behind that virtual engine.
> > > > > But they'd be entirely hidden, and not visible to userspace nor the higher
> > > > > levels.
> > > > >
> > > > > What am I missing?
> > > > 
> > > > Not really, I think you got it. I think at the end of day this really comes down
> > > > to do we want to allow more than 1 parallel virtual engine per gem context? If
> > > > the answer is yes we collapse a parallel virtual engine into a single slot, if
> > > > not we leave as is.
> > > 
> > > Yup. So right now media uses one gem context per engine they need. Since
> > > media doesn't care about perf/OA they could get shared VM by sharing the
> > > VM across gem ctx, which they already do. So probably we could get away if
> > > we leave parallel engines as a gem ctx level thing.
> > > 
> > > Also on the media-driver code the impact is nil since it's just a
> > > different chain of context extensions in the same ioctl call.
> > > 
> > > Bigger picture is that Jason is quite unhappy withou our gem ctx based
> > > uapi, and his long term idea is to make gem ctx into a pure container
> > > object with pointers to engines and a vm. And not something that has
> > > relevance itself. Currently that's not the case for perf/OA, which works
> > > on the gem ctx, and Jason's already unhappy about that one. So adding more
> > > stuff on the gem ctx level feels a bit like a mistake.
> > > 
> > > Cheers, Daniel
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > >  #define I915_DEFINE_CONTEXT_PARAM_ENGINES(name__, N__) struct { \
> > > > > >  	__u64 extensions; \
> > > > > >  	struct i915_engine_class_instance engines[N__]; \
> > > > > > -- 
> > > > > > 2.28.0
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Intel-gfx mailing list
> > > > > > Intel-gfx@lists.freedesktop.org
> > > > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > > > 
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > > 
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-05-17 17:53 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 17:30 [RFC PATCH 0/5] GuC submission / DRM scheduler integration plan + new uAPI Matthew Brost
2021-05-06 17:30 ` [Intel-gfx] " Matthew Brost
2021-05-06 17:27 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
2021-05-06 17:30 ` [RFC PATCH 1/5] drm/doc/rfc: i915 GuC submission / DRM scheduler integration plan Matthew Brost
2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
2021-05-11 14:34   ` Daniel Vetter
2021-05-11 14:34     ` Daniel Vetter
2021-05-11 14:58     ` Daniel Stone
2021-05-11 14:58       ` Daniel Stone
2021-05-11 15:12       ` Daniel Vetter
2021-05-11 15:12         ` Daniel Vetter
2021-05-06 17:30 ` [RFC PATCH 2/5] drm/doc/rfc: i915 new parallel submission uAPI plan Matthew Brost
2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
2021-05-11 14:49   ` Daniel Vetter
2021-05-11 14:49     ` Daniel Vetter
2021-05-11 17:51     ` Matthew Brost
2021-05-11 17:51       ` Matthew Brost
2021-05-06 17:30 ` [RFC PATCH 3/5] drm/i915: Expose logical engine instance to user Matthew Brost
2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
2021-05-11 14:53   ` Daniel Vetter
2021-05-11 14:53     ` [Intel-gfx] " Daniel Vetter
2021-05-06 17:30 ` [RFC PATCH 4/5] drm/i915: Introduce 'set parallel submit' extension Matthew Brost
2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
2021-05-11 15:11   ` Daniel Vetter
2021-05-11 15:11     ` Daniel Vetter
2021-05-11 18:44     ` Matthew Brost
2021-05-11 18:44       ` Matthew Brost
2021-05-12  8:34       ` Daniel Vetter
2021-05-12  8:34         ` Daniel Vetter
2021-05-14 20:05         ` Matthew Brost
2021-05-14 20:05           ` Matthew Brost
2021-05-17 13:55           ` Daniel Vetter
2021-05-17 13:55             ` Daniel Vetter
2021-05-17 17:46             ` Matthew Brost
2021-05-17 17:46               ` Matthew Brost
2021-05-06 17:30 ` [RFC PATCH 5/5] drm/i915: Update execbuf IOCTL to accept N BBs Matthew Brost
2021-05-06 17:30   ` [Intel-gfx] " Matthew Brost
2021-05-11 15:13   ` Daniel Vetter
2021-05-11 15:13     ` Daniel Vetter
2021-05-11 18:01     ` Matthew Brost
2021-05-11 18:01       ` Matthew Brost

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.