All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [RFC v3 0/3] drm/doc/rfc: i915 VM_BIND feature design + uapi
@ 2022-05-17 18:32 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

This is the i915 driver VM_BIND feature design RFC patch series along
with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation.
v3: Add more documentation and proper kernel-doc formatting with cross
    references (including missing i915_drm uapi kernel-docs which are
    required) as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Niranjana Vishwanathapura (3):
  drm/doc/rfc: VM_BIND feature design document
  drm/i915: Update i915 uapi documentation
  drm/doc/rfc: VM_BIND uapi definition

 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.h   | 399 +++++++++++++++++++++++++
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 include/uapi/drm/i915_drm.h            | 153 +++++++---
 5 files changed, 825 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC v3 0/3] drm/doc/rfc: i915 VM_BIND feature design + uapi
@ 2022-05-17 18:32 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

This is the i915 driver VM_BIND feature design RFC patch series along
with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation.
v3: Add more documentation and proper kernel-doc formatting with cross
    references (including missing i915_drm uapi kernel-docs which are
    required) as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Niranjana Vishwanathapura (3):
  drm/doc/rfc: VM_BIND feature design document
  drm/i915: Update i915 uapi documentation
  drm/doc/rfc: VM_BIND uapi definition

 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.h   | 399 +++++++++++++++++++++++++
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 include/uapi/drm/i915_drm.h            | 153 +++++++---
 5 files changed, 825 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32 ` Niranjana Vishwanathapura
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 310 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 36a76cbe9095..64cb924ec5bb 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+.. _indefinite_dma_fences:
+
 Indefinite DMA Fences
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
+* Support for userptr gem objects (no special uapi is required for this).
+
+Execbuff ioctl in VM_BIND mode
+-------------------------------
+The execbuff ioctl handling in VM_BIND mode differs significantly from the
+older method. A VM in VM_BIND mode will not support older execbuff mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
+no support for implicit sync. It is expected that the below work will be able
+to support requirements of object dependency setting in all use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)
+
+This also means, we need an execbuff extension to pass in the batch
+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+
+If at all execlist support in execbuff ioctl is deemed necessary for
+implicit sync in certain use cases, then support can be added later.
+
+In VM_BIND mode, VA allocation is completely managed by the user instead of
+the i915 driver. Hence all VA assignment, eviction are not applicable in
+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
+be using the i915_vma active reference tracking. It will instead use dma-resv
+object for that (See `VM_BIND dma_resv usage`_).
+
+So, a lot of existing code in the execbuff path like relocations, VA evictions,
+vma lookup table, implicit sync, vma active reference tracking etc., are not
+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
+by clearly separating out the functionalities where the VM_BIND mode differs
+from older method and they should be moved to separate files.
+
+VM_PRIVATE objects
+-------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus, the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+The locking design here supports the older (execlist based) execbuff mode, the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
+system allocator support (See `Shared Virtual Memory (SVM) support`_).
+The older execbuff mode and the newer VM_BIND mode without page faults manages
+residency of backing storage using dma_fence. The VM_BIND mode with page faults
+and the system allocator support do not use any dma_fence at all.
+
+VM_BIND locking order is as below.
+
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
+   mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple page fault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+   The older execbuff mode of binding do not need this lock.
+
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
+   be held while binding/unbinding a vma in the async worker and while updating
+   dma-resv fence list of an object. Note that private BOs of a VM will all
+   share a dma-resv object.
+
+   The future system allocator support will use the HMM prescribed locking
+   instead.
+
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
+   invalidated vmas (due to eviction and userptr invalidation) etc.
+
+When GPU page faults are supported, the execbuff path do not take any of these
+locks. There we will simply smash the new batch buffer address into the ring and
+then tell the scheduler run that. The lock taking only happens from the page
+fault handler, where we take lock-A in read mode, whichever lock-B we need to
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
+system allocator) and some additional locks (lock-D) for taking care of page
+table races. Page fault mode should not need to ever manipulate the vm lists,
+so won't ever need lock-C.
+
+VM_BIND LRU handling
+---------------------
+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
+performance degradation. We will also need support for bulk LRU movement of
+VM_BIND objects to avoid additional latencies in execbuff path.
+
+The page table pages are similar to VM_BIND mapped objects (See
+`Evictable page table allocations`_) and are maintained per VM and needs to
+be pinned in memory when VM is made active (ie., upon an execbuff call with
+that VM). So, bulk LRU movement of page table pages is also needed.
+
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
+over to the ttm LRU in some fashion to make sure we once again have a reasonable
+and consistent memory aging and reclaim architecture.
+
+VM_BIND dma_resv usage
+-----------------------
+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
+over sync (See enum dma_resv_usage). One can override it with either
+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
+setting (either through explicit or implicit mechanism).
+
+When vm_bind is called for a non-private object while the VM is already
+active, the fences need to be copied from VM's shared dma-resv object
+(common to all private objects of the VM) to this non-private object.
+If this results in performance degradation, then some optimization will
+be needed here. This is not a problem for VM's private objects as they use
+shared dma-resv object which is always updated on each execbuff submission.
+
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
+older i915_vma active reference tracking which is deprecated. This should be
+easier to get it working with the current TTM backend. We can remove the
+i915_vma active reference tracking fully while supporting TTM backend for igfx.
+
+Evictable page table allocations
+---------------------------------
+Make pagetable allocations evictable and manage them similar to VM_BIND
+mapped objects. Page table pages are similar to persistent mappings of a
+VM (difference here are that the page table pages will not have an i915_vma
+structure and after swapping pages back in, parent page link needs to be
+updated).
+
+Mesa use case
+--------------
+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
+hence improving performance of CPU-bound applications. It also allows us to
+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
+reducing CPU overhead becomes more impactful.
+
+
+VM_BIND Compute support
+========================
+
+User/Memory Fence
+------------------
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter. User/Memory fence is a <address, value> pair. To signal the
+user fence, specified value will be written at the specified virtual address
+and wakeup the waiting process. User can wait on a user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal a user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
+context creation. The dma-fence based user interfaces like gem_wait ioctl and
+execbuff out fence are not allowed on long running contexts. Implicit sync is
+not valid as well and is anyway not supported in VM_BIND mode.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context preempt fence (also called suspend fence) proxying
+as i915_request fence. This suspend fence is enabled when someone tries to wait
+on it, which then triggers the context preemption.
+
+As this support for context suspension using a preempt fence and the resume work
+for the compute mode contexts can get tricky to get it right, it is better to
+add this support in drm scheduler so that multiple drivers can make use of it.
+That means, it will have a dependency on i915 drm scheduler conversion with GuC
+scheduler backend. This should be fine, as the plan is to support compute mode
+contexts only with GuC scheduler backend (at least initially). This is much
+easier to support with VM_BIND mode compared to the current heavier execbuff
+path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. This is made possible by VM_BIND is not being synchronized against
+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
+submitted jobs.
+
+Other VM_BIND use cases
+========================
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debugged) and attached
+to GPU via vm_bind interface.
+
+GPU page faults
+----------------
+GPU page faults when supported (in future), will only be supported in the
+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
+binding will require using dma-fence to ensure residency, the GPU page faults
+mode when supported, will not use any dma-fence as residency is purely managed
+by installing and removing/invalidating page table entries.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only mapping, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface. SVM is only supported with GPU page
+faults enabled.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+use cases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
+  is active or not.
+
+
+VM_BIND UAPI
+=============
+
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 310 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 36a76cbe9095..64cb924ec5bb 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+.. _indefinite_dma_fences:
+
 Indefinite DMA Fences
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
+* Support for userptr gem objects (no special uapi is required for this).
+
+Execbuff ioctl in VM_BIND mode
+-------------------------------
+The execbuff ioctl handling in VM_BIND mode differs significantly from the
+older method. A VM in VM_BIND mode will not support older execbuff mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
+no support for implicit sync. It is expected that the below work will be able
+to support requirements of object dependency setting in all use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)
+
+This also means, we need an execbuff extension to pass in the batch
+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+
+If at all execlist support in execbuff ioctl is deemed necessary for
+implicit sync in certain use cases, then support can be added later.
+
+In VM_BIND mode, VA allocation is completely managed by the user instead of
+the i915 driver. Hence all VA assignment, eviction are not applicable in
+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
+be using the i915_vma active reference tracking. It will instead use dma-resv
+object for that (See `VM_BIND dma_resv usage`_).
+
+So, a lot of existing code in the execbuff path like relocations, VA evictions,
+vma lookup table, implicit sync, vma active reference tracking etc., are not
+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
+by clearly separating out the functionalities where the VM_BIND mode differs
+from older method and they should be moved to separate files.
+
+VM_PRIVATE objects
+-------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus, the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+The locking design here supports the older (execlist based) execbuff mode, the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
+system allocator support (See `Shared Virtual Memory (SVM) support`_).
+The older execbuff mode and the newer VM_BIND mode without page faults manages
+residency of backing storage using dma_fence. The VM_BIND mode with page faults
+and the system allocator support do not use any dma_fence at all.
+
+VM_BIND locking order is as below.
+
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
+   mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple page fault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+   The older execbuff mode of binding do not need this lock.
+
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
+   be held while binding/unbinding a vma in the async worker and while updating
+   dma-resv fence list of an object. Note that private BOs of a VM will all
+   share a dma-resv object.
+
+   The future system allocator support will use the HMM prescribed locking
+   instead.
+
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
+   invalidated vmas (due to eviction and userptr invalidation) etc.
+
+When GPU page faults are supported, the execbuff path do not take any of these
+locks. There we will simply smash the new batch buffer address into the ring and
+then tell the scheduler run that. The lock taking only happens from the page
+fault handler, where we take lock-A in read mode, whichever lock-B we need to
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
+system allocator) and some additional locks (lock-D) for taking care of page
+table races. Page fault mode should not need to ever manipulate the vm lists,
+so won't ever need lock-C.
+
+VM_BIND LRU handling
+---------------------
+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
+performance degradation. We will also need support for bulk LRU movement of
+VM_BIND objects to avoid additional latencies in execbuff path.
+
+The page table pages are similar to VM_BIND mapped objects (See
+`Evictable page table allocations`_) and are maintained per VM and needs to
+be pinned in memory when VM is made active (ie., upon an execbuff call with
+that VM). So, bulk LRU movement of page table pages is also needed.
+
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
+over to the ttm LRU in some fashion to make sure we once again have a reasonable
+and consistent memory aging and reclaim architecture.
+
+VM_BIND dma_resv usage
+-----------------------
+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
+over sync (See enum dma_resv_usage). One can override it with either
+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
+setting (either through explicit or implicit mechanism).
+
+When vm_bind is called for a non-private object while the VM is already
+active, the fences need to be copied from VM's shared dma-resv object
+(common to all private objects of the VM) to this non-private object.
+If this results in performance degradation, then some optimization will
+be needed here. This is not a problem for VM's private objects as they use
+shared dma-resv object which is always updated on each execbuff submission.
+
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
+older i915_vma active reference tracking which is deprecated. This should be
+easier to get it working with the current TTM backend. We can remove the
+i915_vma active reference tracking fully while supporting TTM backend for igfx.
+
+Evictable page table allocations
+---------------------------------
+Make pagetable allocations evictable and manage them similar to VM_BIND
+mapped objects. Page table pages are similar to persistent mappings of a
+VM (difference here are that the page table pages will not have an i915_vma
+structure and after swapping pages back in, parent page link needs to be
+updated).
+
+Mesa use case
+--------------
+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
+hence improving performance of CPU-bound applications. It also allows us to
+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
+reducing CPU overhead becomes more impactful.
+
+
+VM_BIND Compute support
+========================
+
+User/Memory Fence
+------------------
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter. User/Memory fence is a <address, value> pair. To signal the
+user fence, specified value will be written at the specified virtual address
+and wakeup the waiting process. User can wait on a user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal a user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
+context creation. The dma-fence based user interfaces like gem_wait ioctl and
+execbuff out fence are not allowed on long running contexts. Implicit sync is
+not valid as well and is anyway not supported in VM_BIND mode.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context preempt fence (also called suspend fence) proxying
+as i915_request fence. This suspend fence is enabled when someone tries to wait
+on it, which then triggers the context preemption.
+
+As this support for context suspension using a preempt fence and the resume work
+for the compute mode contexts can get tricky to get it right, it is better to
+add this support in drm scheduler so that multiple drivers can make use of it.
+That means, it will have a dependency on i915 drm scheduler conversion with GuC
+scheduler backend. This should be fine, as the plan is to support compute mode
+contexts only with GuC scheduler backend (at least initially). This is much
+easier to support with VM_BIND mode compared to the current heavier execbuff
+path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. This is made possible by VM_BIND is not being synchronized against
+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
+submitted jobs.
+
+Other VM_BIND use cases
+========================
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debugged) and attached
+to GPU via vm_bind interface.
+
+GPU page faults
+----------------
+GPU page faults when supported (in future), will only be supported in the
+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
+binding will require using dma-fence to ensure residency, the GPU page faults
+mode when supported, will not use any dma-fence as residency is purely managed
+by installing and removing/invalidating page table entries.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only mapping, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface. SVM is only supported with GPU page
+faults enabled.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+use cases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
+  is active or not.
+
+
+VM_BIND UAPI
+=============
+
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [Intel-gfx] [RFC v3 2/3] drm/i915: Update i915 uapi documentation
  2022-05-17 18:32 ` Niranjana Vishwanathapura
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

Add some missing i915 upai documentation which the new
i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a2def7b27009..8c834a31b56f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
 
 /* Must be kept compact -- no holes and well documented */
 
+/**
+ * typedef drm_i915_getparam_t - Driver parameter query structure.
+ */
 typedef struct drm_i915_getparam {
+	/** @param: Driver parameter to query. */
 	__s32 param;
-	/*
+
+	/**
+	 * @value: Address of memory where queried value should be put.
+	 *
 	 * WARNING: Using pointers instead of fixed-size u64 means we need to write
 	 * compat32 code. Don't repeat this mistake.
 	 */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
 	__u64 rsvd2;
 };
 
+/**
+ * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
+ * ioctl.
+ *
+ * The request will wait for input fence to signal before submission.
+ *
+ * The returned output fence will be signaled after the completion of the
+ * request.
+ */
 struct drm_i915_gem_exec_fence {
-	/**
-	 * User's handle for a drm_syncobj to wait on or signal.
-	 */
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
 	__u32 handle;
 
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_EXEC_FENCE_WAIT:
+	 * Wait for the input fence before request submission.
+	 *
+	 * I915_EXEC_FENCE_SIGNAL:
+	 * Return request completion fence as output
+	 */
+	__u32 flags;
 #define I915_EXEC_FENCE_WAIT            (1<<0)
 #define I915_EXEC_FENCE_SIGNAL          (1<<1)
 #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
-	__u32 flags;
 };
 
-/*
- * See drm_i915_gem_execbuffer_ext_timeline_fences.
- */
-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
-
-/*
+/**
+ * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
+ * for execbuff.
+ *
  * This structure describes an array of drm_syncobj and associated points for
  * timeline variants of drm_syncobj. It is invalid to append this structure to
  * the execbuf if I915_EXEC_FENCE_ARRAY is set.
  */
 struct drm_i915_gem_execbuffer_ext_timeline_fences {
+#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
+	/** @base: Extension link. See struct i915_user_extension. */
 	struct i915_user_extension base;
 
 	/**
-	 * Number of element in the handles_ptr & value_ptr arrays.
+	 * @fence_count: Number of element in the @handles_ptr & @value_ptr
+	 * arrays.
 	 */
 	__u64 fence_count;
 
 	/**
-	 * Pointer to an array of struct drm_i915_gem_exec_fence of length
-	 * fence_count.
+	 * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
+	 * of length @fence_count.
 	 */
 	__u64 handles_ptr;
 
 	/**
-	 * Pointer to an array of u64 values of length fence_count. Values
-	 * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
-	 * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
 	 */
 	__u64 values_ptr;
 };
 
+/**
+ * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
+ */
 struct drm_i915_gem_execbuffer2 {
-	/**
-	 * List of gem_exec_object2 structs
-	 */
+	/** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
 	__u64 buffers_ptr;
+
+	/** @buffer_count: Number of elements in @buffers_ptr array */
 	__u32 buffer_count;
 
-	/** Offset in the batchbuffer to start execution from. */
+	/**
+	 * @batch_start_offset: Offset in the batchbuffer to start execution
+	 * from.
+	 */
 	__u32 batch_start_offset;
-	/** Bytes used in batchbuffer from batch_start_offset */
+
+	/** @batch_len: Bytes used in batchbuffer from batch_start_offset */
 	__u32 batch_len;
+
+	/** @DR1: deprecated */
 	__u32 DR1;
+
+	/** @DR4: deprecated */
 	__u32 DR4;
+
+	/** @num_cliprects: See @cliprects_ptr */
 	__u32 num_cliprects;
+
 	/**
-	 * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
-	 * & I915_EXEC_USE_EXTENSIONS are not set.
+	 * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
+	 *
+	 * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
+	 * I915_EXEC_USE_EXTENSIONS flags are not set.
 	 *
 	 * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
-	 * of struct drm_i915_gem_exec_fence and num_cliprects is the length
-	 * of the array.
+	 * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
+	 * array.
 	 *
 	 * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
-	 * single struct i915_user_extension and num_cliprects is 0.
+	 * single &i915_user_extension and num_cliprects is 0.
 	 */
 	__u64 cliprects_ptr;
+
+	/** @flags: Execbuff flags */
+	__u64 flags;
 #define I915_EXEC_RING_MASK              (0x3f)
 #define I915_EXEC_DEFAULT                (0<<0)
 #define I915_EXEC_RENDER                 (1<<0)
@@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
 #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
 #define I915_EXEC_CONSTANTS_ABSOLUTE 	(1<<6)
 #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
-	__u64 flags;
-	__u64 rsvd1; /* now used for context info */
-	__u64 rsvd2;
-};
 
 /** Resets the SO write offset registers for transform feedback on gen7. */
 #define I915_EXEC_GEN7_SOL_RESET	(1<<8)
@@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
  * drm_i915_gem_execbuffer_ext enum.
  */
 #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
-
 #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
 
+	/** @rsvd1: Context id */
+	__u64 rsvd1;
+
+	/**
+	 * @rsvd2: in and out sync_file file descriptors.
+	 *
+	 * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
+	 * lower 32 bits of this field will have the in sync_file fd (input).
+	 *
+	 * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
+	 * field will have the out sync_file fd (output).
+	 */
+	__u64 rsvd2;
+};
+
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
 	(eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
@@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
 	__u32 pad;
 };
 
+/**
+ * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
+ */
 struct drm_i915_gem_context_create_ext {
-	__u32 ctx_id; /* output: id of new context*/
+	/** @ctx_id: Id of the created context (output) */
+	__u32 ctx_id;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
+	 *
+	 * Extensions may be appended to this structure and driver must check
+	 * for those.
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
+	 *
+	 * Created context will have single timeline.
+	 */
 	__u32 flags;
 #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS	(1u << 0)
 #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE	(1u << 1)
 #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
 	(-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
+
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
 };
 
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
 	__u32 pad;
 };
 
-/*
+/**
+ * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
+ *
  * DRM_I915_GEM_VM_CREATE -
  *
  * Create a new virtual memory address space (ppGTT) for use within a context
@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
  * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
  * returned in the outparam @id.
  *
- * No flags are defined, with all bits reserved and must be zero.
- *
  * An extension chain maybe provided, starting with @extensions, and terminated
  * by the @next_extension being 0. Currently, no extensions are defined.
  *
  * DRM_I915_GEM_VM_DESTROY -
  *
- * Destroys a previously created VM id, specified in @id.
+ * Destroys a previously created VM id, specified in @vm_id.
  *
  * No extensions or flags are allowed currently, and so must be zero.
  */
 struct drm_i915_gem_vm_control {
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
+
+	/** @flags: reserved for future usage, currently MBZ */
 	__u32 flags;
+
+	/** @vm_id: Id of the VM created or to be destroyed */
 	__u32 vm_id;
 };
 
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC v3 2/3] drm/i915: Update i915 uapi documentation
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

Add some missing i915 upai documentation which the new
i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a2def7b27009..8c834a31b56f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
 
 /* Must be kept compact -- no holes and well documented */
 
+/**
+ * typedef drm_i915_getparam_t - Driver parameter query structure.
+ */
 typedef struct drm_i915_getparam {
+	/** @param: Driver parameter to query. */
 	__s32 param;
-	/*
+
+	/**
+	 * @value: Address of memory where queried value should be put.
+	 *
 	 * WARNING: Using pointers instead of fixed-size u64 means we need to write
 	 * compat32 code. Don't repeat this mistake.
 	 */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
 	__u64 rsvd2;
 };
 
+/**
+ * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
+ * ioctl.
+ *
+ * The request will wait for input fence to signal before submission.
+ *
+ * The returned output fence will be signaled after the completion of the
+ * request.
+ */
 struct drm_i915_gem_exec_fence {
-	/**
-	 * User's handle for a drm_syncobj to wait on or signal.
-	 */
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
 	__u32 handle;
 
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_EXEC_FENCE_WAIT:
+	 * Wait for the input fence before request submission.
+	 *
+	 * I915_EXEC_FENCE_SIGNAL:
+	 * Return request completion fence as output
+	 */
+	__u32 flags;
 #define I915_EXEC_FENCE_WAIT            (1<<0)
 #define I915_EXEC_FENCE_SIGNAL          (1<<1)
 #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
-	__u32 flags;
 };
 
-/*
- * See drm_i915_gem_execbuffer_ext_timeline_fences.
- */
-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
-
-/*
+/**
+ * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
+ * for execbuff.
+ *
  * This structure describes an array of drm_syncobj and associated points for
  * timeline variants of drm_syncobj. It is invalid to append this structure to
  * the execbuf if I915_EXEC_FENCE_ARRAY is set.
  */
 struct drm_i915_gem_execbuffer_ext_timeline_fences {
+#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
+	/** @base: Extension link. See struct i915_user_extension. */
 	struct i915_user_extension base;
 
 	/**
-	 * Number of element in the handles_ptr & value_ptr arrays.
+	 * @fence_count: Number of element in the @handles_ptr & @value_ptr
+	 * arrays.
 	 */
 	__u64 fence_count;
 
 	/**
-	 * Pointer to an array of struct drm_i915_gem_exec_fence of length
-	 * fence_count.
+	 * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
+	 * of length @fence_count.
 	 */
 	__u64 handles_ptr;
 
 	/**
-	 * Pointer to an array of u64 values of length fence_count. Values
-	 * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
-	 * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
 	 */
 	__u64 values_ptr;
 };
 
+/**
+ * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
+ */
 struct drm_i915_gem_execbuffer2 {
-	/**
-	 * List of gem_exec_object2 structs
-	 */
+	/** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
 	__u64 buffers_ptr;
+
+	/** @buffer_count: Number of elements in @buffers_ptr array */
 	__u32 buffer_count;
 
-	/** Offset in the batchbuffer to start execution from. */
+	/**
+	 * @batch_start_offset: Offset in the batchbuffer to start execution
+	 * from.
+	 */
 	__u32 batch_start_offset;
-	/** Bytes used in batchbuffer from batch_start_offset */
+
+	/** @batch_len: Bytes used in batchbuffer from batch_start_offset */
 	__u32 batch_len;
+
+	/** @DR1: deprecated */
 	__u32 DR1;
+
+	/** @DR4: deprecated */
 	__u32 DR4;
+
+	/** @num_cliprects: See @cliprects_ptr */
 	__u32 num_cliprects;
+
 	/**
-	 * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
-	 * & I915_EXEC_USE_EXTENSIONS are not set.
+	 * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
+	 *
+	 * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
+	 * I915_EXEC_USE_EXTENSIONS flags are not set.
 	 *
 	 * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
-	 * of struct drm_i915_gem_exec_fence and num_cliprects is the length
-	 * of the array.
+	 * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
+	 * array.
 	 *
 	 * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
-	 * single struct i915_user_extension and num_cliprects is 0.
+	 * single &i915_user_extension and num_cliprects is 0.
 	 */
 	__u64 cliprects_ptr;
+
+	/** @flags: Execbuff flags */
+	__u64 flags;
 #define I915_EXEC_RING_MASK              (0x3f)
 #define I915_EXEC_DEFAULT                (0<<0)
 #define I915_EXEC_RENDER                 (1<<0)
@@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
 #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
 #define I915_EXEC_CONSTANTS_ABSOLUTE 	(1<<6)
 #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
-	__u64 flags;
-	__u64 rsvd1; /* now used for context info */
-	__u64 rsvd2;
-};
 
 /** Resets the SO write offset registers for transform feedback on gen7. */
 #define I915_EXEC_GEN7_SOL_RESET	(1<<8)
@@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
  * drm_i915_gem_execbuffer_ext enum.
  */
 #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
-
 #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
 
+	/** @rsvd1: Context id */
+	__u64 rsvd1;
+
+	/**
+	 * @rsvd2: in and out sync_file file descriptors.
+	 *
+	 * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
+	 * lower 32 bits of this field will have the in sync_file fd (input).
+	 *
+	 * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
+	 * field will have the out sync_file fd (output).
+	 */
+	__u64 rsvd2;
+};
+
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
 	(eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
@@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
 	__u32 pad;
 };
 
+/**
+ * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
+ */
 struct drm_i915_gem_context_create_ext {
-	__u32 ctx_id; /* output: id of new context*/
+	/** @ctx_id: Id of the created context (output) */
+	__u32 ctx_id;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
+	 *
+	 * Extensions may be appended to this structure and driver must check
+	 * for those.
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
+	 *
+	 * Created context will have single timeline.
+	 */
 	__u32 flags;
 #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS	(1u << 0)
 #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE	(1u << 1)
 #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
 	(-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
+
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
 };
 
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
 	__u32 pad;
 };
 
-/*
+/**
+ * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
+ *
  * DRM_I915_GEM_VM_CREATE -
  *
  * Create a new virtual memory address space (ppGTT) for use within a context
@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
  * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
  * returned in the outparam @id.
  *
- * No flags are defined, with all bits reserved and must be zero.
- *
  * An extension chain maybe provided, starting with @extensions, and terminated
  * by the @next_extension being 0. Currently, no extensions are defined.
  *
  * DRM_I915_GEM_VM_DESTROY -
  *
- * Destroys a previously created VM id, specified in @id.
+ * Destroys a previously created VM id, specified in @vm_id.
  *
  * No extensions or flags are allowed currently, and so must be zero.
  */
 struct drm_i915_gem_vm_control {
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
+
+	/** @flags: reserved for future usage, currently MBZ */
 	__u32 flags;
+
+	/** @vm_id: Id of the VM created or to be destroyed */
 	__u32 vm_id;
 };
 
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32 ` Niranjana Vishwanathapura
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.
    Also add new uapi and documentation as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
 1 file changed, 399 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index 000000000000..589c0a009107
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,399 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ */
+#define I915_PARAM_HAS_VM_BIND		57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * A VM in VM_BIND mode will not support the older execbuff mode of binding.
+ * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
+ * to pass in the batch buffer addresses.
+ *
+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+ * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
+ * to be not used.
+ *
+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
+ * to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ */
+struct drm_i915_gem_vm_bind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @handle: Object handle */
+	__u32 handle;
+
+	/** @start: Virtual Address start to bind */
+	__u64 start;
+
+	/** @offset: Offset in object to bind */
+	__u64 offset;
+
+	/** @length: Length of mapping to bind */
+	__u64 length;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_GEM_VM_BIND_READONLY:
+	 * Mapping is read-only.
+	 *
+	 * I915_GEM_VM_BIND_CAPTURE:
+	 * Capture this mapping in the dump upon GPU error.
+	 */
+	__u64 flags;
+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
+ *
+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
+ * address (VA) range that should be unbound from the device page table of the
+ * specified address space (VM). The specified VA range must match one of the
+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
+ * completion.
+ */
+struct drm_i915_gem_vm_unbind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @rsvd: Reserved for future use; must be zero. */
+	__u32 rsvd;
+
+	/** @start: Virtual Address start to unbind */
+	__u64 start;
+
+	/** @length: Length of mapping to unbind */
+	__u64 length;
+
+	/** @flags: reserved for future usage, currently MBZ */
+	__u64 flags;
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
+ * or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
+ * before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the returned output fence
+ * after the completion of binding or unbinding.
+ */
+struct drm_i915_vm_bind_fence {
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
+	__u32 handle;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
+ * and vm_unbind.
+ *
+ * This structure describes an array of timeline drm_syncobj and associated
+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
+ */
+struct drm_i915_vm_bind_ext_timeline_fences {
+#define I915_VM_BIND_EXT_timeline_FENCES	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
+	 * arrays.
+	 */
+	__u64 fence_count;
+
+	/**
+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
+	 * of length @fence_count.
+	 */
+	__u64 handles_ptr;
+
+	/**
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
+	 */
+	__u64 values_ptr;
+};
+
+/**
+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
+ * vm_bind or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
+ * @addr to become equal to @val) before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the output fence after
+ * the completion of binding or unbinding by writing @val to memory location at
+ * @addr
+ */
+struct drm_i915_vm_bind_user_fence {
+	/** @addr: User/Memory fence qword aligned process virtual address */
+	__u64 addr;
+
+	/** @val: User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_USER_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
+ * and vm_unbind.
+ *
+ * These user fences can be input or output fences
+ * (See struct drm_i915_vm_bind_user_fence).
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
+	__u64 fence_count;
+
+	/**
+	 * @user_fence_ptr: Pointer to an array of
+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
+	 */
+	__u64 user_fence_ptr;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
+ * gpu virtual addresses.
+ *
+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
+ * must always be appended in the VM_BIND mode and it will be an error to
+ * append this extension in older non-VM_BIND mode.
+ */
+struct drm_i915_gem_execbuffer_ext_batch_addresses {
+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @count: Number of addresses in the addr array. */
+	__u32 count;
+
+	/** @addr: An array of batch gpu virtual addresses. */
+	__u64 addr[0];
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (@addr, @value pair) to an
+ * execbuf to be signaled by the command streamer after the completion of first
+ * level batch, by writing the @value at specified @addr and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @addr: User/Memory fence qword aligned GPU virtual address.
+	 *
+	 * Address has to be a valid GPU virtual address at the time of
+	 * first level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * @value: User/Memory fence Value to be written to above address
+	 * after first level batch completes.
+	 */
+	__u64 value;
+
+	/** @rsvd: Reserved for future extensions, MBZ */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
+ * private to the specified VM.
+ *
+ * See struct drm_i915_gem_create_ext.
+ */
+struct drm_i915_gem_create_ext_vm_private {
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @vm_id: Id of the VM to which the object is private */
+	__u32 vm_id;
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
+ *
+ * User/Memory fence can be woken up either by:
+ *
+ * 1. GPU context indicated by @ctx_id, or,
+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *    @ctx_id is ignored when this flag is set.
+ *
+ * Wakeup condition is,
+ * ``((*addr & mask) op (value & mask))``
+ *
+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
+ */
+struct drm_i915_gem_wait_user_fence {
+	/** @extensions: Zero-terminated chain of extensions. */
+	__u64 extensions;
+
+	/** @addr: User/Memory fence address */
+	__u64 addr;
+
+	/** @ctx_id: Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** @op: Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_UFENCE_WAIT_SOFT:
+	 *
+	 * To be woken up by i915 driver async worker (not by GPU).
+	 *
+	 * I915_UFENCE_WAIT_ABSTIME:
+	 *
+	 * Wait timeout specified as absolute time.
+	 */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** @value: Wakeup value */
+	__u64 value;
+
+	/** @mask: Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/**
+	 * @timeout: Wait timeout in nanoseconds.
+	 *
+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
+	 * absolute time in nsec.
+	 */
+	__s64 timeout;
+};
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.
    Also add new uapi and documentation as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
 1 file changed, 399 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index 000000000000..589c0a009107
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,399 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ */
+#define I915_PARAM_HAS_VM_BIND		57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * A VM in VM_BIND mode will not support the older execbuff mode of binding.
+ * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
+ * to pass in the batch buffer addresses.
+ *
+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+ * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
+ * to be not used.
+ *
+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
+ * to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ */
+struct drm_i915_gem_vm_bind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @handle: Object handle */
+	__u32 handle;
+
+	/** @start: Virtual Address start to bind */
+	__u64 start;
+
+	/** @offset: Offset in object to bind */
+	__u64 offset;
+
+	/** @length: Length of mapping to bind */
+	__u64 length;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_GEM_VM_BIND_READONLY:
+	 * Mapping is read-only.
+	 *
+	 * I915_GEM_VM_BIND_CAPTURE:
+	 * Capture this mapping in the dump upon GPU error.
+	 */
+	__u64 flags;
+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
+ *
+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
+ * address (VA) range that should be unbound from the device page table of the
+ * specified address space (VM). The specified VA range must match one of the
+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
+ * completion.
+ */
+struct drm_i915_gem_vm_unbind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @rsvd: Reserved for future use; must be zero. */
+	__u32 rsvd;
+
+	/** @start: Virtual Address start to unbind */
+	__u64 start;
+
+	/** @length: Length of mapping to unbind */
+	__u64 length;
+
+	/** @flags: reserved for future usage, currently MBZ */
+	__u64 flags;
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
+ * or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
+ * before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the returned output fence
+ * after the completion of binding or unbinding.
+ */
+struct drm_i915_vm_bind_fence {
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
+	__u32 handle;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
+ * and vm_unbind.
+ *
+ * This structure describes an array of timeline drm_syncobj and associated
+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
+ */
+struct drm_i915_vm_bind_ext_timeline_fences {
+#define I915_VM_BIND_EXT_timeline_FENCES	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
+	 * arrays.
+	 */
+	__u64 fence_count;
+
+	/**
+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
+	 * of length @fence_count.
+	 */
+	__u64 handles_ptr;
+
+	/**
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
+	 */
+	__u64 values_ptr;
+};
+
+/**
+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
+ * vm_bind or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
+ * @addr to become equal to @val) before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the output fence after
+ * the completion of binding or unbinding by writing @val to memory location at
+ * @addr
+ */
+struct drm_i915_vm_bind_user_fence {
+	/** @addr: User/Memory fence qword aligned process virtual address */
+	__u64 addr;
+
+	/** @val: User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_USER_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
+ * and vm_unbind.
+ *
+ * These user fences can be input or output fences
+ * (See struct drm_i915_vm_bind_user_fence).
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
+	__u64 fence_count;
+
+	/**
+	 * @user_fence_ptr: Pointer to an array of
+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
+	 */
+	__u64 user_fence_ptr;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
+ * gpu virtual addresses.
+ *
+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
+ * must always be appended in the VM_BIND mode and it will be an error to
+ * append this extension in older non-VM_BIND mode.
+ */
+struct drm_i915_gem_execbuffer_ext_batch_addresses {
+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @count: Number of addresses in the addr array. */
+	__u32 count;
+
+	/** @addr: An array of batch gpu virtual addresses. */
+	__u64 addr[0];
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (@addr, @value pair) to an
+ * execbuf to be signaled by the command streamer after the completion of first
+ * level batch, by writing the @value at specified @addr and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @addr: User/Memory fence qword aligned GPU virtual address.
+	 *
+	 * Address has to be a valid GPU virtual address at the time of
+	 * first level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * @value: User/Memory fence Value to be written to above address
+	 * after first level batch completes.
+	 */
+	__u64 value;
+
+	/** @rsvd: Reserved for future extensions, MBZ */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
+ * private to the specified VM.
+ *
+ * See struct drm_i915_gem_create_ext.
+ */
+struct drm_i915_gem_create_ext_vm_private {
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @vm_id: Id of the VM to which the object is private */
+	__u32 vm_id;
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
+ *
+ * User/Memory fence can be woken up either by:
+ *
+ * 1. GPU context indicated by @ctx_id, or,
+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *    @ctx_id is ignored when this flag is set.
+ *
+ * Wakeup condition is,
+ * ``((*addr & mask) op (value & mask))``
+ *
+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
+ */
+struct drm_i915_gem_wait_user_fence {
+	/** @extensions: Zero-terminated chain of extensions. */
+	__u64 extensions;
+
+	/** @addr: User/Memory fence address */
+	__u64 addr;
+
+	/** @ctx_id: Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** @op: Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_UFENCE_WAIT_SOFT:
+	 *
+	 * To be woken up by i915 driver async worker (not by GPU).
+	 *
+	 * I915_UFENCE_WAIT_ABSTIME:
+	 *
+	 * Wait timeout specified as absolute time.
+	 */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** @value: Wakeup value */
+	__u64 value;
+
+	/** @mask: Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/**
+	 * @timeout: Wait timeout in nanoseconds.
+	 *
+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
+	 * absolute time in nsec.
+	 */
+	__s64 timeout;
+};
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (3 preceding siblings ...)
  (?)
@ 2022-05-17 20:49 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-17 20:49 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : warning

== Summary ==

Error: dim checkpatch failed
b4f01b5605b4 drm/doc/rfc: VM_BIND feature design document
-:27: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#27: 
new file mode 100644

-:32: WARNING:SPDX_LICENSE_TAG: Missing or malformed SPDX-License-Identifier tag in line 1
#32: FILE: Documentation/gpu/rfc/i915_vm_bind.rst:1:
+==========================================

total: 0 errors, 2 warnings, 0 checks, 319 lines checked
50b7adcbd762 drm/i915: Update i915 uapi documentation
e72771b5018c drm/doc/rfc: VM_BIND uapi definition
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 6, in <module>
    from ply import lex, yacc
ModuleNotFoundError: No module named 'ply'
-:15: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#15: 
new file mode 100644

-:83: WARNING:LONG_LINE: line length of 126 exceeds 100 columns
#83: FILE: Documentation/gpu/rfc/i915_vm_bind.h:64:
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)

-:84: WARNING:LONG_LINE: line length of 128 exceeds 100 columns
#84: FILE: Documentation/gpu/rfc/i915_vm_bind.h:65:
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)

-:85: WARNING:LONG_LINE: line length of 142 exceeds 100 columns
#85: FILE: Documentation/gpu/rfc/i915_vm_bind.h:66:
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

-:184: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#184: FILE: Documentation/gpu/rfc/i915_vm_bind.h:165:
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
                                              ^

-:185: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#185: FILE: Documentation/gpu/rfc/i915_vm_bind.h:166:
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
                                              ^

-:252: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#252: FILE: Documentation/gpu/rfc/i915_vm_bind.h:233:
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
                                                   ^

-:253: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#253: FILE: Documentation/gpu/rfc/i915_vm_bind.h:234:
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
                                                   ^

total: 0 errors, 4 warnings, 4 checks, 399 lines checked



^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (4 preceding siblings ...)
  (?)
@ 2022-05-17 20:49 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-17 20:49 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (5 preceding siblings ...)
  (?)
@ 2022-05-17 21:09 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-17 21:09 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 7689 bytes --]

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11668 -> Patchwork_93447v3
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/index.html

Participating hosts (40 -> 40)
------------------------------

  Additional (2): fi-kbl-soraka bat-dg2-8 
  Missing    (2): fi-rkl-11600 bat-dg2-9 

Known issues
------------

  Here are the changes found in Patchwork_93447v3 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_fence@basic-busy@bcs0:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][1] ([fdo#109271]) +9 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@gem_exec_fence@basic-busy@bcs0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][2] ([fdo#109271] / [i915#2190])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@basic:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][3] ([fdo#109271] / [i915#4613]) +3 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@gem_lmem_swapping@basic.html

  * igt@i915_selftest@live@gem:
    - fi-blb-e6850:       NOTRUN -> [DMESG-FAIL][4] ([i915#4528])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-blb-e6850/igt@i915_selftest@live@gem.html

  * igt@i915_selftest@live@gem_migrate:
    - fi-bdw-5557u:       [PASS][5] -> [INCOMPLETE][6] ([i915#5716])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/fi-bdw-5557u/igt@i915_selftest@live@gem_migrate.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-bdw-5557u/igt@i915_selftest@live@gem_migrate.html

  * igt@i915_selftest@live@gt_pm:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][7] ([i915#1886])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@i915_selftest@live@gt_pm.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-hsw-4770:        NOTRUN -> [SKIP][8] ([fdo#109271] / [fdo#111827])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-hsw-4770/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@dp-edid-read:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][9] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@kms_chamelium@dp-edid-read.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][10] ([fdo#109271] / [i915#533])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@hangcheck:
    - bat-dg1-5:          [DMESG-FAIL][11] ([i915#4494] / [i915#4957]) -> [PASS][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/bat-dg1-5/igt@i915_selftest@live@hangcheck.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/bat-dg1-5/igt@i915_selftest@live@hangcheck.html
    - fi-hsw-4770:        [INCOMPLETE][13] ([i915#4785]) -> [PASS][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html

  * igt@i915_selftest@live@requests:
    - fi-blb-e6850:       [DMESG-FAIL][15] ([i915#4528]) -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/fi-blb-e6850/igt@i915_selftest@live@requests.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-blb-e6850/igt@i915_selftest@live@requests.html

  * igt@kms_flip@basic-flip-vs-modeset@a-edp1:
    - {bat-adlp-6}:       [DMESG-WARN][17] ([i915#3576]) -> [PASS][18] +1 similar issue
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1886]: https://gitlab.freedesktop.org/drm/intel/issues/1886
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
  [i915#3547]: https://gitlab.freedesktop.org/drm/intel/issues/3547
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3576]: https://gitlab.freedesktop.org/drm/intel/issues/3576
  [i915#3595]: https://gitlab.freedesktop.org/drm/intel/issues/3595
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
  [i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213
  [i915#4215]: https://gitlab.freedesktop.org/drm/intel/issues/4215
  [i915#4494]: https://gitlab.freedesktop.org/drm/intel/issues/4494
  [i915#4528]: https://gitlab.freedesktop.org/drm/intel/issues/4528
  [i915#4579]: https://gitlab.freedesktop.org/drm/intel/issues/4579
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4785]: https://gitlab.freedesktop.org/drm/intel/issues/4785
  [i915#4873]: https://gitlab.freedesktop.org/drm/intel/issues/4873
  [i915#4957]: https://gitlab.freedesktop.org/drm/intel/issues/4957
  [i915#5190]: https://gitlab.freedesktop.org/drm/intel/issues/5190
  [i915#5274]: https://gitlab.freedesktop.org/drm/intel/issues/5274
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#5354]: https://gitlab.freedesktop.org/drm/intel/issues/5354
  [i915#5716]: https://gitlab.freedesktop.org/drm/intel/issues/5716
  [i915#5763]: https://gitlab.freedesktop.org/drm/intel/issues/5763
  [i915#5879]: https://gitlab.freedesktop.org/drm/intel/issues/5879
  [i915#5903]: https://gitlab.freedesktop.org/drm/intel/issues/5903


Build changes
-------------

  * Linux: CI_DRM_11668 -> Patchwork_93447v3

  CI-20190529: 20190529
  CI_DRM_11668: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6477: 70cfef35851891aeaa829f5e8dcb7fd43b454bde @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_93447v3: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

45b73cbaf9c4 drm/doc/rfc: VM_BIND uapi definition
8b2e89a9b707 drm/i915: Update i915 uapi documentation
f1f39186a68b drm/doc/rfc: VM_BIND feature design document

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/index.html

[-- Attachment #2: Type: text/html, Size: 7460 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (6 preceding siblings ...)
  (?)
@ 2022-05-18  2:33 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-18  2:33 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 55368 bytes --]

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_11668_full -> Patchwork_93447v3_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_93447v3_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_93447v3_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (11 -> 11)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_93447v3_full:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_dither@fb-8bpc-vs-panel-6bpc@edp-1-pipe-a:
    - shard-iclb:         NOTRUN -> [SKIP][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_dither@fb-8bpc-vs-panel-6bpc@edp-1-pipe-a.html

  
Known issues
------------

  Here are the changes found in Patchwork_93447v3_full that come from known issues:

### CI changes ###

#### Issues hit ####

  * boot:
    - shard-apl:          ([PASS][2], [PASS][3], [PASS][4], [PASS][5], [PASS][6], [PASS][7], [PASS][8], [PASS][9], [PASS][10], [PASS][11], [PASS][12], [PASS][13], [PASS][14], [PASS][15], [PASS][16], [PASS][17], [PASS][18], [PASS][19], [PASS][20], [PASS][21], [PASS][22], [PASS][23], [PASS][24], [PASS][25], [PASS][26]) -> ([PASS][27], [PASS][28], [PASS][29], [PASS][30], [PASS][31], [PASS][32], [PASS][33], [PASS][34], [PASS][35], [PASS][36], [PASS][37], [PASS][38], [PASS][39], [PASS][40], [PASS][41], [PASS][42], [PASS][43], [PASS][44], [PASS][45], [FAIL][46], [PASS][47], [PASS][48], [PASS][49], [PASS][50], [PASS][51]) ([i915#4386])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/boot.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/boot.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/boot.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/boot.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/boot.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/boot.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl2/boot.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl2/boot.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl2/boot.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/boot.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/boot.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/boot.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/boot.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/boot.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/boot.html
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/boot.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/boot.html
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/boot.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/boot.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/boot.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/boot.html

  

### IGT changes ###

#### Issues hit ####

  * igt@gem_ccs@ctrl-surf-copy:
    - shard-iclb:         NOTRUN -> [SKIP][52] ([i915#5327])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_ccs@ctrl-surf-copy.html

  * igt@gem_ccs@suspend-resume:
    - shard-kbl:          NOTRUN -> [SKIP][53] ([fdo#109271]) +27 similar issues
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@gem_ccs@suspend-resume.html

  * igt@gem_eio@unwedge-stress:
    - shard-snb:          NOTRUN -> [FAIL][54] ([i915#3354])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-none-solo@rcs0:
    - shard-apl:          [PASS][55] -> [FAIL][56] ([i915#2842])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/igt@gem_exec_fair@basic-none-solo@rcs0.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/igt@gem_exec_fair@basic-none-solo@rcs0.html
    - shard-iclb:         NOTRUN -> [FAIL][57] ([i915#2842])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@gem_exec_fair@basic-none-solo@rcs0.html

  * igt@gem_exec_fair@basic-none@rcs0:
    - shard-kbl:          [PASS][58] -> [FAIL][59] ([i915#2842]) +5 similar issues
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@gem_exec_fair@basic-none@rcs0.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@gem_exec_fair@basic-none@rcs0.html

  * igt@gem_exec_fair@basic-pace@bcs0:
    - shard-iclb:         [PASS][60] -> [FAIL][61] ([i915#2842])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@gem_exec_fair@basic-pace@bcs0.html
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@gem_exec_fair@basic-pace@bcs0.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-iclb:         [PASS][62] -> [FAIL][63] ([i915#2849])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb8/igt@gem_exec_fair@basic-throttle@rcs0.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb5/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_exec_flush@basic-uc-prw-default:
    - shard-snb:          [PASS][64] -> [SKIP][65] ([fdo#109271]) +2 similar issues
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb2/igt@gem_exec_flush@basic-uc-prw-default.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb6/igt@gem_exec_flush@basic-uc-prw-default.html

  * igt@gem_lmem_swapping@parallel-multi:
    - shard-apl:          NOTRUN -> [SKIP][66] ([fdo#109271] / [i915#4613])
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/igt@gem_lmem_swapping@parallel-multi.html

  * igt@gem_lmem_swapping@verify-random:
    - shard-skl:          NOTRUN -> [SKIP][67] ([fdo#109271] / [i915#4613]) +1 similar issue
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@gem_lmem_swapping@verify-random.html

  * igt@gem_mmap_gtt@fault-concurrent-y:
    - shard-snb:          [PASS][68] -> [INCOMPLETE][69] ([i915#5161])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb4/igt@gem_mmap_gtt@fault-concurrent-y.html
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@gem_mmap_gtt@fault-concurrent-y.html

  * igt@gem_pxp@protected-raw-src-copy-not-readible:
    - shard-iclb:         NOTRUN -> [SKIP][70] ([i915#4270])
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@gem_pxp@protected-raw-src-copy-not-readible.html

  * igt@gem_render_copy@y-tiled-mc-ccs-to-vebox-y-tiled:
    - shard-iclb:         NOTRUN -> [SKIP][71] ([i915#768])
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_render_copy@y-tiled-mc-ccs-to-vebox-y-tiled.html

  * igt@gem_userptr_blits@dmabuf-sync:
    - shard-iclb:         NOTRUN -> [SKIP][72] ([i915#3323])
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_userptr_blits@dmabuf-sync.html

  * igt@gem_userptr_blits@dmabuf-unsync:
    - shard-iclb:         NOTRUN -> [SKIP][73] ([i915#3297])
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_userptr_blits@dmabuf-unsync.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-skl:          NOTRUN -> [FAIL][74] ([i915#3318])
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@gem_userptr_blits@vma-merge.html

  * igt@gen3_mixed_blits:
    - shard-iclb:         NOTRUN -> [SKIP][75] ([fdo#109289]) +1 similar issue
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gen3_mixed_blits.html

  * igt@gen9_exec_parse@bb-oversize:
    - shard-tglb:         NOTRUN -> [SKIP][76] ([i915#2527] / [i915#2856])
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@gen9_exec_parse@bb-oversize.html

  * igt@gen9_exec_parse@bb-start-out:
    - shard-iclb:         NOTRUN -> [SKIP][77] ([i915#2856])
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gen9_exec_parse@bb-start-out.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-skl:          NOTRUN -> [FAIL][78] ([i915#454])
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@i915_pm_dc@dc6-psr.html

  * igt@i915_pm_rc6_residency@rc6-fence:
    - shard-iclb:         NOTRUN -> [WARN][79] ([i915#2684])
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@i915_pm_rc6_residency@rc6-fence.html

  * igt@i915_pm_rpm@modeset-non-lpsp-stress:
    - shard-tglb:         NOTRUN -> [SKIP][80] ([fdo#111644] / [i915#1397] / [i915#2411])
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@i915_pm_rpm@modeset-non-lpsp-stress.html

  * igt@kms_atomic_transition@plane-all-modeset-transition-fencing:
    - shard-iclb:         NOTRUN -> [SKIP][81] ([i915#1769])
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_atomic_transition@plane-all-modeset-transition-fencing.html

  * igt@kms_big_fb@4-tiled-16bpp-rotate-0:
    - shard-tglb:         NOTRUN -> [SKIP][82] ([i915#5286])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_big_fb@4-tiled-16bpp-rotate-0.html

  * igt@kms_big_fb@4-tiled-max-hw-stride-64bpp-rotate-0-hflip:
    - shard-iclb:         NOTRUN -> [SKIP][83] ([i915#5286]) +1 similar issue
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_big_fb@4-tiled-max-hw-stride-64bpp-rotate-0-hflip.html

  * igt@kms_big_fb@linear-32bpp-rotate-90:
    - shard-iclb:         NOTRUN -> [SKIP][84] ([fdo#110725] / [fdo#111614]) +1 similar issue
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_big_fb@linear-32bpp-rotate-90.html

  * igt@kms_big_fb@y-tiled-64bpp-rotate-270:
    - shard-tglb:         NOTRUN -> [SKIP][85] ([fdo#111614])
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_big_fb@y-tiled-64bpp-rotate-270.html

  * igt@kms_big_fb@yf-tiled-16bpp-rotate-270:
    - shard-tglb:         NOTRUN -> [SKIP][86] ([fdo#111615])
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_big_fb@yf-tiled-16bpp-rotate-270.html

  * igt@kms_ccs@pipe-a-ccs-on-another-bo-y_tiled_gen12_mc_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][87] ([i915#3689] / [i915#3886])
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_ccs@pipe-a-ccs-on-another-bo-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-a-crc-primary-basic-y_tiled_gen12_rc_ccs_cc:
    - shard-apl:          NOTRUN -> [SKIP][88] ([fdo#109271] / [i915#3886]) +1 similar issue
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@kms_ccs@pipe-a-crc-primary-basic-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs_cc:
    - shard-kbl:          NOTRUN -> [SKIP][89] ([fdo#109271] / [i915#3886])
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-ccs-on-another-bo-yf_tiled_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][90] ([fdo#111615] / [i915#3689])
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_ccs@pipe-b-ccs-on-another-bo-yf_tiled_ccs.html

  * igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          NOTRUN -> [SKIP][91] ([fdo#109271] / [i915#3886]) +5 similar issues
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc:
    - shard-iclb:         NOTRUN -> [SKIP][92] ([fdo#109278] / [i915#3886]) +4 similar issues
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_cdclk@mode-transition:
    - shard-apl:          NOTRUN -> [SKIP][93] ([fdo#109271]) +60 similar issues
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@kms_cdclk@mode-transition.html

  * igt@kms_chamelium@hdmi-edid-read:
    - shard-tglb:         NOTRUN -> [SKIP][94] ([fdo#109284] / [fdo#111827]) +2 similar issues
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_chamelium@hdmi-edid-read.html

  * igt@kms_chamelium@hdmi-hpd-storm-disable:
    - shard-skl:          NOTRUN -> [SKIP][95] ([fdo#109271] / [fdo#111827]) +11 similar issues
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_chamelium@hdmi-hpd-storm-disable.html

  * igt@kms_chamelium@vga-hpd:
    - shard-apl:          NOTRUN -> [SKIP][96] ([fdo#109271] / [fdo#111827]) +5 similar issues
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/igt@kms_chamelium@vga-hpd.html

  * igt@kms_color_chamelium@pipe-b-ctm-max:
    - shard-kbl:          NOTRUN -> [SKIP][97] ([fdo#109271] / [fdo#111827]) +1 similar issue
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@kms_color_chamelium@pipe-b-ctm-max.html

  * igt@kms_color_chamelium@pipe-c-ctm-0-5:
    - shard-iclb:         NOTRUN -> [SKIP][98] ([fdo#109284] / [fdo#111827]) +2 similar issues
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_color_chamelium@pipe-c-ctm-0-5.html

  * igt@kms_color_chamelium@pipe-c-ctm-max:
    - shard-snb:          NOTRUN -> [SKIP][99] ([fdo#109271] / [fdo#111827])
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@kms_color_chamelium@pipe-c-ctm-max.html

  * igt@kms_color_chamelium@pipe-d-degamma:
    - shard-iclb:         NOTRUN -> [SKIP][100] ([fdo#109278] / [fdo#109284] / [fdo#111827])
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_color_chamelium@pipe-d-degamma.html

  * igt@kms_content_protection@legacy:
    - shard-apl:          NOTRUN -> [TIMEOUT][101] ([i915#1319])
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@kms_content_protection@legacy.html

  * igt@kms_content_protection@uevent:
    - shard-kbl:          NOTRUN -> [FAIL][102] ([i915#2105])
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@kms_content_protection@uevent.html

  * igt@kms_cursor_crc@pipe-b-cursor-32x10-rapid-movement:
    - shard-iclb:         NOTRUN -> [SKIP][103] ([fdo#109278]) +14 similar issues
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_cursor_crc@pipe-b-cursor-32x10-rapid-movement.html

  * igt@kms_cursor_crc@pipe-c-cursor-512x170-sliding:
    - shard-tglb:         NOTRUN -> [SKIP][104] ([fdo#109279] / [i915#3359] / [i915#5691])
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_cursor_crc@pipe-c-cursor-512x170-sliding.html

  * igt@kms_cursor_crc@pipe-c-cursor-max-size-onscreen:
    - shard-tglb:         NOTRUN -> [SKIP][105] ([i915#3359]) +1 similar issue
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_cursor_crc@pipe-c-cursor-max-size-onscreen.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-toggle:
    - shard-iclb:         NOTRUN -> [SKIP][106] ([fdo#109274] / [fdo#109278])
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_cursor_legacy@cursora-vs-flipb-toggle.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-glk:          [PASS][107] -> [FAIL][108] ([i915#2346] / [i915#533])
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-glk7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-glk7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@pipe-d-torture-move:
    - shard-skl:          NOTRUN -> [SKIP][109] ([fdo#109271]) +117 similar issues
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_cursor_legacy@pipe-d-torture-move.html

  * igt@kms_cursor_legacy@short-busy-flip-before-cursor-toggle:
    - shard-snb:          NOTRUN -> [SKIP][110] ([fdo#109271]) +58 similar issues
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@kms_cursor_legacy@short-busy-flip-before-cursor-toggle.html

  * igt@kms_draw_crc@draw-method-xrgb2101010-blt-4tiled:
    - shard-iclb:         NOTRUN -> [SKIP][111] ([i915#5287])
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_draw_crc@draw-method-xrgb2101010-blt-4tiled.html

  * igt@kms_flip@2x-absolute-wf_vblank:
    - shard-iclb:         NOTRUN -> [SKIP][112] ([fdo#109274])
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_flip@2x-absolute-wf_vblank.html

  * igt@kms_flip@2x-flip-vs-fences:
    - shard-tglb:         NOTRUN -> [SKIP][113] ([fdo#109274] / [fdo#111825]) +2 similar issues
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_flip@2x-flip-vs-fences.html

  * igt@kms_flip@flip-vs-expired-vblank@b-edp1:
    - shard-skl:          [PASS][114] -> [FAIL][115] ([i915#79])
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl1/igt@kms_flip@flip-vs-expired-vblank@b-edp1.html
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@kms_flip@flip-vs-expired-vblank@b-edp1.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-kbl:          [PASS][116] -> [INCOMPLETE][117] ([i915#3614])
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl7/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl4/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling:
    - shard-tglb:         NOTRUN -> [SKIP][118] ([i915#2587])
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html
    - shard-iclb:         [PASS][119] -> [SKIP][120] ([i915#3701]) +1 similar issue
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb4/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-render:
    - shard-iclb:         NOTRUN -> [SKIP][121] ([fdo#109280]) +13 similar issues
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-render:
    - shard-tglb:         NOTRUN -> [SKIP][122] ([fdo#109280] / [fdo#111825]) +4 similar issues
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-render.html

  * igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes:
    - shard-apl:          [PASS][123] -> [DMESG-WARN][124] ([i915#180]) +2 similar issues
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html

  * igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min:
    - shard-skl:          NOTRUN -> [FAIL][125] ([fdo#108145] / [i915#265]) +1 similar issue
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min.html

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          [PASS][126] -> [FAIL][127] ([fdo#108145] / [i915#265]) +2 similar issues
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl6/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl7/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html

  * igt@kms_plane_lowres@pipe-c-tiling-none:
    - shard-tglb:         NOTRUN -> [SKIP][128] ([i915#3536])
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_plane_lowres@pipe-c-tiling-none.html

  * igt@kms_plane_scaling@downscale-with-rotation-factor-0-25@pipe-c-edp-1-downscale-with-rotation:
    - shard-tglb:         NOTRUN -> [SKIP][129] ([i915#5176]) +3 similar issues
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_plane_scaling@downscale-with-rotation-factor-0-25@pipe-c-edp-1-downscale-with-rotation.html

  * igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1-planes-downscale:
    - shard-iclb:         [PASS][130] -> [SKIP][131] ([i915#5235]) +2 similar issues
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb7/igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1-planes-downscale.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1-planes-downscale.html

  * igt@kms_psr2_su@page_flip-nv12:
    - shard-iclb:         NOTRUN -> [SKIP][132] ([fdo#109642] / [fdo#111068] / [i915#658])
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_psr2_su@page_flip-nv12.html

  * igt@kms_psr2_su@page_flip-p010:
    - shard-skl:          NOTRUN -> [SKIP][133] ([fdo#109271] / [i915#658])
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_psr2_su@page_flip-p010.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [PASS][134] -> [SKIP][135] ([fdo#109441])
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@kms_psr@psr2_cursor_render.html
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@kms_psr@psr2_cursor_render.html

  * igt@kms_psr@psr2_sprite_mmap_gtt:
    - shard-tglb:         NOTRUN -> [FAIL][136] ([i915#132] / [i915#3467])
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_psr@psr2_sprite_mmap_gtt.html

  * igt@kms_writeback@writeback-check-output:
    - shard-skl:          NOTRUN -> [SKIP][137] ([fdo#109271] / [i915#2437])
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_writeback@writeback-check-output.html

  * igt@kms_writeback@writeback-pixel-formats:
    - shard-iclb:         NOTRUN -> [SKIP][138] ([i915#2437]) +1 similar issue
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_writeback@writeback-pixel-formats.html

  * igt@nouveau_crc@pipe-b-source-outp-inactive:
    - shard-iclb:         NOTRUN -> [SKIP][139] ([i915#2530])
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@nouveau_crc@pipe-b-source-outp-inactive.html

  * igt@perf@unprivileged-single-ctx-counters:
    - shard-tglb:         NOTRUN -> [SKIP][140] ([fdo#109289])
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@perf@unprivileged-single-ctx-counters.html

  * igt@prime_nv_api@i915_self_import_to_different_fd:
    - shard-tglb:         NOTRUN -> [SKIP][141] ([fdo#109291])
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@prime_nv_api@i915_self_import_to_different_fd.html

  * igt@prime_nv_api@nv_i915_reimport_twice_check_flink_name:
    - shard-iclb:         NOTRUN -> [SKIP][142] ([fdo#109291])
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@prime_nv_api@nv_i915_reimport_twice_check_flink_name.html

  * igt@prime_vgem@basic-userptr:
    - shard-iclb:         NOTRUN -> [SKIP][143] ([i915#3301])
   [143]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@prime_vgem@basic-userptr.html

  * igt@syncobj_timeline@invalid-transfer-non-existent-point:
    - shard-apl:          NOTRUN -> [DMESG-WARN][144] ([i915#5098])
   [144]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@syncobj_timeline@invalid-transfer-non-existent-point.html
    - shard-skl:          NOTRUN -> [DMESG-WARN][145] ([i915#5098])
   [145]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@syncobj_timeline@invalid-transfer-non-existent-point.html

  * igt@sysfs_clients@fair-1:
    - shard-iclb:         NOTRUN -> [SKIP][146] ([i915#2994])
   [146]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@sysfs_clients@fair-1.html

  * igt@sysfs_clients@sema-25:
    - shard-skl:          NOTRUN -> [SKIP][147] ([fdo#109271] / [i915#2994]) +1 similar issue
   [147]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@sysfs_clients@sema-25.html
    - shard-apl:          NOTRUN -> [SKIP][148] ([fdo#109271] / [i915#2994])
   [148]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@sysfs_clients@sema-25.html

  * igt@sysfs_clients@split-25:
    - shard-kbl:          NOTRUN -> [SKIP][149] ([fdo#109271] / [i915#2994])
   [149]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@sysfs_clients@split-25.html

  * igt@sysfs_heartbeat_interval@mixed@bcs0:
    - shard-skl:          [PASS][150] -> [WARN][151] ([i915#4055])
   [150]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@sysfs_heartbeat_interval@mixed@bcs0.html
   [151]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@sysfs_heartbeat_interval@mixed@bcs0.html

  * igt@sysfs_heartbeat_interval@mixed@vcs0:
    - shard-skl:          [PASS][152] -> [FAIL][153] ([i915#1731])
   [152]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@sysfs_heartbeat_interval@mixed@vcs0.html
   [153]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@sysfs_heartbeat_interval@mixed@vcs0.html

  
#### Possible fixes ####

  * igt@gem_ctx_isolation@preservation-s3@vecs0:
    - shard-skl:          [INCOMPLETE][154] ([i915#4939]) -> [PASS][155]
   [154]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl4/igt@gem_ctx_isolation@preservation-s3@vecs0.html
   [155]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@gem_ctx_isolation@preservation-s3@vecs0.html

  * igt@gem_eio@in-flight-contexts-1us:
    - {shard-tglu}:       [TIMEOUT][156] ([i915#3063]) -> [PASS][157]
   [156]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglu-3/igt@gem_eio@in-flight-contexts-1us.html
   [157]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglu-4/igt@gem_eio@in-flight-contexts-1us.html

  * igt@gem_eio@unwedge-stress:
    - shard-iclb:         [TIMEOUT][158] ([i915#3070]) -> [PASS][159]
   [158]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@gem_eio@unwedge-stress.html
   [159]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-pace-solo@rcs0:
    - shard-tglb:         [FAIL][160] ([i915#2842]) -> [PASS][161]
   [160]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb5/igt@gem_exec_fair@basic-pace-solo@rcs0.html
   [161]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb6/igt@gem_exec_fair@basic-pace-solo@rcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [FAIL][162] ([i915#2842]) -> [PASS][163] +1 similar issue
   [162]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl6/igt@gem_exec_fair@basic-pace@vecs0.html
   [163]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl3/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_exec_flush@basic-batch-kernel-default-uc:
    - shard-snb:          [SKIP][164] ([fdo#109271]) -> [PASS][165]
   [164]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb6/igt@gem_exec_flush@basic-batch-kernel-default-uc.html
   [165]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb4/igt@gem_exec_flush@basic-batch-kernel-default-uc.html

  * igt@gen9_exec_parse@allowed-single:
    - shard-kbl:          [DMESG-WARN][166] ([i915#5566] / [i915#716]) -> [PASS][167]
   [166]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@gen9_exec_parse@allowed-single.html
   [167]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@gen9_exec_parse@allowed-single.html

  * igt@i915_pm_rpm@cursor:
    - shard-iclb:         [SKIP][168] -> [PASS][169]
   [168]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@i915_pm_rpm@cursor.html
   [169]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@i915_pm_rpm@cursor.html

  * igt@i915_selftest@live@gt_lrc:
    - shard-iclb:         [DMESG-WARN][170] ([i915#2867]) -> [PASS][171] +7 similar issues
   [170]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@i915_selftest@live@gt_lrc.html
   [171]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@i915_selftest@live@gt_lrc.html

  * igt@i915_selftest@live@hangcheck:
    - shard-snb:          [INCOMPLETE][172] ([i915#3921]) -> [PASS][173]
   [172]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb6/igt@i915_selftest@live@hangcheck.html
   [173]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@i915_selftest@live@hangcheck.html
    - shard-tglb:         [DMESG-WARN][174] ([i915#5591]) -> [PASS][175]
   [174]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb8/igt@i915_selftest@live@hangcheck.html
   [175]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb6/igt@i915_selftest@live@hangcheck.html

  * igt@i915_selftest@perf@engine_cs:
    - shard-tglb:         [DMESG-WARN][176] ([i915#2867]) -> [PASS][177] +2 similar issues
   [176]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb2/igt@i915_selftest@perf@engine_cs.html
   [177]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb8/igt@i915_selftest@perf@engine_cs.html

  * igt@kms_cursor_legacy@cursor-vs-flip-toggle:
    - shard-iclb:         [FAIL][178] ([i915#5072]) -> [PASS][179]
   [178]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb7/igt@kms_cursor_legacy@cursor-vs-flip-toggle.html
   [179]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@kms_cursor_legacy@cursor-vs-flip-toggle.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-glk:          [FAIL][180] ([i915#2346]) -> [PASS][181]
   [180]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-glk2/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [181]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-glk2/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-apl:          [FAIL][182] ([i915#4767]) -> [PASS][183]
   [182]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@kms_fbcon_fbt@fbc-suspend.html
   [183]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1:
    - shard-skl:          [FAIL][184] ([i915#79]) -> [PASS][185]
   [184]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html
   [185]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl6/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-apl:          [DMESG-WARN][186] ([i915#180]) -> [PASS][187] +2 similar issues
   [186]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [187]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip@flip-vs-suspend@a-edp1:
    - shard-skl:          [INCOMPLETE][188] ([i915#4839]) -> [PASS][189]
   [188]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl9/igt@kms_flip@flip-vs-suspend@a-edp1.html
   [189]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@kms_flip@flip-vs-suspend@a-edp1.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling:
    - shard-iclb:         [SKIP][190] ([i915#3701]) -> [PASS][191]
   [190]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html
   [191]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html

  * igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min:
    - shard-skl:          [FAIL][192] ([fdo#108145] / [i915#265]) -> [PASS][193]
   [192]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min.html
   [193]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl8/igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min.html

  * igt@kms_psr2_su@page_flip-xrgb8888:
    - shard-iclb:         [SKIP][194] ([fdo#109642] / [fdo#111068] / [i915#658]) -> [PASS][195]
   [194]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb4/igt@kms_psr2_su@page_flip-xrgb8888.html
   [195]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_psr2_su@page_flip-xrgb8888.html

  * igt@kms_psr@psr2_no_drrs:
    - shard-iclb:         [SKIP][196] ([fdo#109441]) -> [PASS][197] +2 similar issues
   [196]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@kms_psr@psr2_no_drrs.html
   [197]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_psr@psr2_no_drrs.html

  * igt@kms_vblank@pipe-a-ts-continuation-modeset-rpm:
    - shard-iclb:         [SKIP][198] ([fdo#109278]) -> [PASS][199]
   [198]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@kms_vblank@pipe-a-ts-continuation-modeset-rpm.html
   [199]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_vblank@pipe-a-ts-continuation-modeset-rpm.html

  * igt@perf@stress-open-close:
    - shard-apl:          [DMESG-FAIL][200] -> [PASS][201]
   [200]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@perf@stress-open-close.html
   [201]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@perf@stress-open-close.html

  * igt@prime_self_import@reimport-vs-gem_close-race:
    - shard-tglb:         [FAIL][202] -> [PASS][203]
   [202]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb3/igt@prime_self_import@reimport-vs-gem_close-race.html
   [203]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@prime_self_import@reimport-vs-gem_close-race.html

  
#### Warnings ####

  * igt@gem_eio@unwedge-stress:
    - shard-tglb:         [FAIL][204] ([i915#5784]) -> [TIMEOUT][205] ([i915#3063])
   [204]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb8/igt@gem_eio@unwedge-stress.html
   [205]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb6/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_balancer@parallel:
    - shard-iclb:         [SKIP][206] ([i915#4525]) -> [DMESG-WARN][207] ([i915#5614])
   [206]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb5/igt@gem_exec_balancer@parallel.html
   [207]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb1/igt@gem_exec_balancer@parallel.html

  * igt@gem_exec_balancer@parallel-out-fence:
    - shard-iclb:         [DMESG-WARN][208] ([i915#5614]) -> [SKIP][209] ([i915#4525]) +1 similar issue
   [208]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb4/igt@gem_exec_balancer@parallel-out-fence.html
   [209]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@gem_exec_balancer@parallel-out-fence.html

  * igt@gem_exec_fair@basic-none-rrul@rcs0:
    - shard-iclb:         [FAIL][210] ([i915#2852]) -> [FAIL][211] ([i915#2842])
   [210]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb7/igt@gem_exec_fair@basic-none-rrul@rcs0.html
   [211]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@gem_exec_fair@basic-none-rrul@rcs0.html

  * igt@i915_pm_dc@dc3co-vpb-simulation:
    - shard-iclb:         [SKIP][212] ([i915#588]) -> [SKIP][213] ([i915#658])
   [212]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@i915_pm_dc@dc3co-vpb-simulation.html
   [213]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@i915_pm_dc@dc3co-vpb-simulation.html

  * igt@i915_pm_rpm@modeset-non-lpsp:
    - shard-iclb:         [SKIP][214] -> [SKIP][215] ([fdo#110892])
   [214]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@i915_pm_rpm@modeset-non-lpsp.html
   [215]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@i915_pm_rpm@modeset-non-lpsp.html

  * igt@kms_ccs@pipe-d-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          [SKIP][216] ([fdo#109271] / [i915#1888]) -> [SKIP][217] ([fdo#109271])
   [216]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@kms_ccs@pipe-d-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc.html
   [217]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl6/igt@kms_ccs@pipe-d-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_chamelium@dp-audio:
    - shard-skl:          [SKIP][218] ([fdo#109271] / [fdo#111827] / [i915#1888]) -> [SKIP][219] ([fdo#109271] / [fdo#111827])
   [218]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl1/igt@kms_chamelium@dp-audio.html
   [219]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl4/igt@kms_chamelium@dp-audio.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-varying-size:
    - shard-skl:          [SKIP][220] ([fdo#109271]) -> [SKIP][221] ([fdo#109271] / [i915#1888])
   [220]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl4/igt@kms_cursor_legacy@cursora-vs-flipb-varying-size.html
   [221]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@kms_cursor_legacy@cursora-vs-flipb-varying-size.html

  * igt@kms_psr2_sf@overlay-plane-update-continuous-sf:
    - shard-iclb:         [SKIP][222] ([i915#2920]) -> [SKIP][223] ([fdo#111068] / [i915#658])
   [222]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@kms_psr2_sf@overlay-plane-update-continuous-sf.html
   [223]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_psr2_sf@overlay-plane-update-continuous-sf.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area:
    - shard-iclb:         [SKIP][224] ([fdo#111068] / [i915#658]) -> [SKIP][225] ([i915#2920])
   [224]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area.html
   [225]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][226], [FAIL][227], [FAIL][228], [FAIL][229], [FAIL][230], [FAIL][231], [FAIL][232], [FAIL][233], [FAIL][234], [FAIL][235]) ([fdo#109271] / [i915#3002] / [i915#4312] / [i915#5257] / [i915#716]) -> ([FAIL][236], [FAIL][237], [FAIL][238], [FAIL][239], [FAIL][240], [FAIL][241], [FAIL][242], [FAIL][243], [FAIL][244]) ([i915#3002] / [i915#4312] / [i915#5257])
   [226]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl6/igt@runner@aborted.html
   [227]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl6/igt@runner@aborted.html
   [228]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@runner@aborted.html
   [229]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@runner@aborted.html
   [230]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [231]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [232]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [233]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl1/igt@runner@aborted.html
   [234]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl1/igt@runner@aborted.html
   [235]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [236]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@runner@aborted.html
   [237]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl3/igt@runner@aborted.html
   [238]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl3/igt@runner@aborted.html
   [239]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@runner@aborted.html
   [240]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@runner@aborted.html
   [241]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@runner@aborted.html
   [242]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@runner@aborted.html
   [243]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl4/igt@runner@aborted.html
   [244]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@runner@aborted.html
    - shard-apl:          ([FAIL][245], [FAIL][246], [FAIL][247], [FAIL][248], [FAIL][249], [FAIL][250]) ([i915#180] / [i915#3002] / [i915#4312] / [i915#5257]) -> ([FAIL][251], [FAIL][252], [FAIL][253], [FAIL][254], [FAIL][255], [FAIL][256], [FAIL][257]) ([fdo#109271] / [i915#180] / [i915#3002] / [i915#4312] / [i915#5257])
   [245]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@runner@aborted.html
   [246]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/igt@runner@aborted.html
   [247]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/igt@runner@aborted.html
   [248]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/igt@runner@aborted.html
   [249]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/igt@runner@aborted.html
   [250]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@runner@aborted.html
   [251]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@runner@aborted.html
   [252]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/igt@runner@aborted.html
   [253]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@runner@aborted.html
   [254]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/igt@runner@aborted.html
   [255]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/igt@runner@aborted.html
   [256]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/igt@runner@aborted.html
   [257]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@runner@aborted.html
    - shard-skl:          ([FAIL][258], [FAIL][259], [FAIL][260]) ([i915#3002] / [i915#4312] / [i915#5257]) -> ([FAIL][261], [FAIL][262], [FAIL][263], [FAIL][264], [FAIL][265]) ([i915#2029] / [i915#3002] / [i915#4312] / [i915#5257])
   [258]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl1/igt@runner@aborted.html
   [259]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl10/igt@runner@aborted.html
   [260]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl7/igt@runner@aborted.html
   [261]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@runner@aborted.html
   [262]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@runner@aborted.html
   [263]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl4/igt@runner@aborted.html
   [264]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@runner@aborted.html
   [265]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl6/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#110725]: https://bugs.freedesktop.org/show_bug.cgi?id=110725
  [fdo#110892]: https://bugs.freedesktop.org/show_bug.cgi?id=110892
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111614]: https://bugs.freedesktop.org/show_bug.cgi?id=111614
  [fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
  [fdo#111644]: https://bugs.freedesktop.org/show_bug.cgi?id=111644
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1319]: https://gitlab.freedesktop.org/drm/intel/issues/1319
  [i915#132]: https://gitlab.freedesktop.org/drm/intel/issues/132
  [i915#1397]: https://gitlab.freedesktop.org/drm/intel/issues/1397
  [i915#1731]: https://gitlab.freedesktop.org/drm/intel/issues/1731
  [i915#1769]: https://gitlab.freedesktop.org/drm/intel/issues/1769
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#2029]: https://gitlab.freedesktop.org/drm/intel/issues/2029
  [i915#2105]: https://gitlab.freedesktop.org/drm/intel/issues/2105
  [i915#2346]: https://gitlab.freedesktop.org/drm/intel/issues/2346
  [i915#2411]: https://gitlab.freedesktop.org/drm/intel/issues/2411
  [i915#2437]: https://gitlab.freedesktop.org/drm/intel/issues/2437
  [i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
  [i915#2530]: https://gitlab.freedesktop.org/drm/intel/issues/2530
  [i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
  [i915#265]: https://gitlab.freedesktop.org/drm/intel/issues/265
  [i915#2681]: https://gitlab.freedesktop.org/drm/intel/issues/2681
  [i915#2684]: https://gitlab.freedesktop.org/drm/intel/issues/2684
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2849]: https://gitlab.freedesktop.org/drm/intel/issues/2849
  [i915#2852]: https://gitlab.freedesktop.org/drm/intel/issues/2852
  [i915#2856]: https://gitlab.freedesktop.org/drm/intel/issues/2856
  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
  [i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3002]: https://gitlab.freedesktop.org/drm/intel/issues/3002
  [i915#3063]: https://gitlab.freedesktop.org/drm/intel/issues/3063
  [i915#3070]: https://gitlab.freedesktop.org/drm/intel/issues/3070
  [i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3318]: https://gitlab.freedesktop.org/drm/intel/issues/3318
  [i915#3323]: https://gitlab.freedesktop.org/drm/intel/issues/3323
  [i915#3354]: https://gitlab.freedesktop.org/drm/intel/issues/3354
  [i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
  [i915#3467]: https://gitlab.freedesktop.org/drm/intel/issues/3467
  [i915#3536]: https://gitlab.freedesktop.org/drm/intel/issues/3536
  [i915#3591]: https://gitlab.freedesktop.org/drm/intel/issues/3591
  [i915#3614]: https://gitlab.freedesktop.org/drm/intel/issues/3614
  [i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
  [i915#3701]: https://gitlab.freedesktop.org/drm/intel/issues/3701
  [i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#4055]: https://gitlab.freedesktop.org/drm/intel/issues/4055
  [i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
  [i915#4281]: https://gitlab.freedesktop.org/drm/intel/issues/4281
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4386]: https://gitlab.freedesktop.org/drm/intel/issues/4386
  [i915#4525]: https://gitlab.freedesktop.org/drm/intel/issues/4525
  [i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4767]: https://gitlab.freedesktop.org/drm/intel/issues/4767
  [i915#4839]: https://gitlab.freedesktop.org/drm/intel/issues/4839
  [i915#4939]: https://gitlab.freedesktop.org/drm/intel/issues/4939
  [i915#5072]: https://gitlab.freedesktop.org/drm/intel/issues/5072
  [i915#5098]: https://gitlab.freedesktop.org/drm/intel/issues/5098
  [i915#5161]: https://gitlab.freedesktop.org/drm/intel/issues/5161
  [i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
  [i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
  [i915#5257]: https://gitlab.freedesktop.org/drm/intel/issues/5257
  [i915#5286]: https://gitlab.freedesktop.org/drm/intel/issues/5286
  [i915#5287]: https://gitlab.freedesktop.org/drm/intel/issues/5287
  [i915#5327]: https://gitlab.freedesktop.org/drm/intel/issues/5327
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#5566]: https://gitlab.freedesktop.org/drm/intel/issues/5566
  [i915#5591]: https://gitlab.freedesktop.org/drm/intel/issues/5591
  [i915#5614]: https://gitlab.freedesktop.org/drm/intel/issues/5614
  [i915#5691]: https://gitlab.freedesktop.org/drm/intel/issues/5691
  [i915#5784]: https://gitlab.freedesktop.org/drm/intel/issues/5784
  [i915#588]: https://gitlab.freedesktop.org/drm/intel/issues/588
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
  [i915#716]: https://gitlab.freedesktop.org/drm/intel/issues/716
  [i915#768]: https://gitlab.freedesktop.org/drm/intel/issues/768
  [i915#79]: https://gitlab.freedesktop.org/drm/intel/issues/79


Build changes
-------------

  * Linux: CI_DRM_11668 -> Patchwork_93447v3

  CI-20190529: 20190529
  CI_DRM_11668: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6477: 70cfef35851891aeaa829f5e8dcb7fd43b454bde @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_93447v3: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/index.html

[-- Attachment #2: Type: text/html, Size: 66688 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32   ` Niranjana Vishwanathapura
@ 2022-05-19 22:52     ` Zanoni, Paulo R
  -1 siblings, 0 replies; 121+ messages in thread
From: Zanoni, Paulo R @ 2022-05-19 22:52 UTC (permalink / raw)
  To: dri-devel, Vetter, Daniel, Vishwanathapura, Niranjana, intel-gfx
  Cc: Brost, Matthew, Hellstrom, Thomas, Wilson, Chris P, jason,
	christian.koenig

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.
> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points
to new ioctls and we're not very familiar with those yet, so I think
you should really clarify the interaction between the new additions
here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always
have a few exported buffers in every execbuf call, and we rely on the
implicit synchronization provided by execbuf to make sure everything
works. The execbuf ioctl also has some code to flush caches during
implicit synchronization AFAIR, so I would guess we rely on it too and
whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of
vm_bind was that it would help reduce ioctl latency and cpu overhead.
But if making execbuf faster comes at the cost of requiring additional
ioctls calls for implicit synchronization, which is required on ever
execbuf call, then I wonder if we'll even get any faster at all.
Comparing old execbuf vs plain new execbuf without the new required
ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around
every single execbuf ioctl we submit? Again, more clarification and
some code examples here would be really nice. This is a big change on
an important part of the API, we should clarify the new expected usage.

> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before
we commit to any interface. Again, implicit synchronization is
something we rely on during *every* execbuf ioctl for most workloads.


> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of
ioctls would stop working or make no sense to call when using vm_bind.
Can we please get a complete list of those? Bonus points if the Kernel
starts telling us we just called something that makes no sense.

> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the
ideal case for user space would be that every BO is created as private
but then we'd have an ioctl to convert it to non-private (without the
need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a
buffer that was previously vm_private would be really appreciated.

Thanks,
Paulo


> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of these
> +locks. There we will simply smash the new batch buffer address into the ring and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence) proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-19 22:52     ` Zanoni, Paulo R
  0 siblings, 0 replies; 121+ messages in thread
From: Zanoni, Paulo R @ 2022-05-19 22:52 UTC (permalink / raw)
  To: dri-devel, Vetter, Daniel, Vishwanathapura, Niranjana, intel-gfx
  Cc: Hellstrom, Thomas, Wilson, Chris P, christian.koenig

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.
> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points
to new ioctls and we're not very familiar with those yet, so I think
you should really clarify the interaction between the new additions
here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always
have a few exported buffers in every execbuf call, and we rely on the
implicit synchronization provided by execbuf to make sure everything
works. The execbuf ioctl also has some code to flush caches during
implicit synchronization AFAIR, so I would guess we rely on it too and
whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of
vm_bind was that it would help reduce ioctl latency and cpu overhead.
But if making execbuf faster comes at the cost of requiring additional
ioctls calls for implicit synchronization, which is required on ever
execbuf call, then I wonder if we'll even get any faster at all.
Comparing old execbuf vs plain new execbuf without the new required
ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around
every single execbuf ioctl we submit? Again, more clarification and
some code examples here would be really nice. This is a big change on
an important part of the API, we should clarify the new expected usage.

> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before
we commit to any interface. Again, implicit synchronization is
something we rely on during *every* execbuf ioctl for most workloads.


> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of
ioctls would stop working or make no sense to call when using vm_bind.
Can we please get a complete list of those? Bonus points if the Kernel
starts telling us we just called something that makes no sense.

> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the
ideal case for user space would be that every BO is created as private
but then we'd have an ioctl to convert it to non-private (without the
need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a
buffer that was previously vm_private would be really appreciated.

Thanks,
Paulo


> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of these
> +locks. There we will simply smash the new batch buffer address into the ring and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence) proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32   ` Niranjana Vishwanathapura
  (?)
@ 2022-05-19 23:07   ` Zanoni, Paulo R
  2022-05-23 19:19     ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Zanoni, Paulo R @ 2022-05-19 23:07 UTC (permalink / raw)
  To: dri-devel, Vetter, Daniel, Vishwanathapura, Niranjana, intel-gfx
  Cc: Hellstrom, Thomas, christian.koenig, Wilson, Chris P

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND and related uapi definitions
> 
> v2: Ensure proper kernel-doc formatting with cross references.
>     Also add new uapi and documentation as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>  1 file changed, 399 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..589c0a009107
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,399 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND		57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> + * to pass in the batch buffer addresses.
> + *
> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> + */

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 {
	__u64 buffers_ptr;		-> must be 0 (new)
	__u32 buffer_count;		-> must be 0 (new)
	__u32 batch_start_offset;	-> must be 0 (new)
	__u32 batch_len;		-> must be 0 (new)
	__u32 DR1;			-> must be 0 (old)
	__u32 DR4;			-> must be 0 (old)
	__u32 num_cliprects; (fences)	-> must be 0 since using extensions
	__u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
	__u64 flags;			-> some flags must be 0 (new)
	__u64 rsvd1; (context info)	-> repurposed field (old)
	__u64 rsvd2;			-> unused
};

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
of adding even more complexity to an already abused interface? While
the Vulkan-like extension thing is really nice, I don't think what
we're doing here is extending the ioctl usage, we're completely
changing how the base struct should be interpreted based on how the VM
was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
already at -6 without these changes. I think after vm_bind we'll need
to create a -11 entry just to deal with this ioctl.


+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
+ * to be not used.
+ *
+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
+ * to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ */
+struct drm_i915_gem_vm_bind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @handle: Object handle */
+	__u32 handle;
+
+	/** @start: Virtual Address start to bind */
+	__u64 start;
+
+	/** @offset: Offset in object to bind */
+	__u64 offset;
+
+	/** @length: Length of mapping to bind */
+	__u64 length;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_GEM_VM_BIND_READONLY:
+	 * Mapping is read-only.
+	 *
+	 * I915_GEM_VM_BIND_CAPTURE:
+	 * Capture this mapping in the dump upon GPU error.
+	 */
+	__u64 flags;
+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
+ *
+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
+ * address (VA) range that should be unbound from the device page table of the
+ * specified address space (VM). The specified VA range must match one of the
+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
+ * completion.
+ */
+struct drm_i915_gem_vm_unbind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @rsvd: Reserved for future use; must be zero. */
+	__u32 rsvd;
+
+	/** @start: Virtual Address start to unbind */
+	__u64 start;
+
+	/** @length: Length of mapping to unbind */
+	__u64 length;
+
+	/** @flags: reserved for future usage, currently MBZ */
+	__u64 flags;
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
+ * or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
+ * before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the returned output fence
+ * after the completion of binding or unbinding.
+ */
+struct drm_i915_vm_bind_fence {
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
+	__u32 handle;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
+ * and vm_unbind.
+ *
+ * This structure describes an array of timeline drm_syncobj and associated
+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
+ */
+struct drm_i915_vm_bind_ext_timeline_fences {
+#define I915_VM_BIND_EXT_timeline_FENCES	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
+	 * arrays.
+	 */
+	__u64 fence_count;
+
+	/**
+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
+	 * of length @fence_count.
+	 */
+	__u64 handles_ptr;
+
+	/**
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
+	 */
+	__u64 values_ptr;
+};
+
+/**
+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
+ * vm_bind or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
+ * @addr to become equal to @val) before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the output fence after
+ * the completion of binding or unbinding by writing @val to memory location at
+ * @addr
+ */
+struct drm_i915_vm_bind_user_fence {
+	/** @addr: User/Memory fence qword aligned process virtual address */
+	__u64 addr;
+
+	/** @val: User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_USER_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
+ * and vm_unbind.
+ *
+ * These user fences can be input or output fences
+ * (See struct drm_i915_vm_bind_user_fence).
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
+	__u64 fence_count;
+
+	/**
+	 * @user_fence_ptr: Pointer to an array of
+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
+	 */
+	__u64 user_fence_ptr;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
+ * gpu virtual addresses.
+ *
+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
+ * must always be appended in the VM_BIND mode and it will be an error to
+ * append this extension in older non-VM_BIND mode.
+ */
+struct drm_i915_gem_execbuffer_ext_batch_addresses {
+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @count: Number of addresses in the addr array. */
+	__u32 count;
+
+	/** @addr: An array of batch gpu virtual addresses. */
+	__u64 addr[0];
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (@addr, @value pair) to an
+ * execbuf to be signaled by the command streamer after the completion of first
+ * level batch, by writing the @value at specified @addr and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @addr: User/Memory fence qword aligned GPU virtual address.
+	 *
+	 * Address has to be a valid GPU virtual address at the time of
+	 * first level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * @value: User/Memory fence Value to be written to above address
+	 * after first level batch completes.
+	 */
+	__u64 value;
+
+	/** @rsvd: Reserved for future extensions, MBZ */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
+ * private to the specified VM.
+ *
+ * See struct drm_i915_gem_create_ext.
+ */
+struct drm_i915_gem_create_ext_vm_private {
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @vm_id: Id of the VM to which the object is private */
+	__u32 vm_id;
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
+ *
+ * User/Memory fence can be woken up either by:
+ *
+ * 1. GPU context indicated by @ctx_id, or,
+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *    @ctx_id is ignored when this flag is set.
+ *
+ * Wakeup condition is,
+ * ``((*addr & mask) op (value & mask))``
+ *
+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
+ */
+struct drm_i915_gem_wait_user_fence {
+	/** @extensions: Zero-terminated chain of extensions. */
+	__u64 extensions;
+
+	/** @addr: User/Memory fence address */
+	__u64 addr;
+
+	/** @ctx_id: Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** @op: Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_UFENCE_WAIT_SOFT:
+	 *
+	 * To be woken up by i915 driver async worker (not by GPU).
+	 *
+	 * I915_UFENCE_WAIT_ABSTIME:
+	 *
+	 * Wait timeout specified as absolute time.
+	 */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** @value: Wakeup value */
+	__u64 value;
+
+	/** @mask: Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/**
+	 * @timeout: Wait timeout in nanoseconds.
+	 *
+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
+	 * absolute time in nsec.
+	 */
+	__s64 timeout;
+};


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-19 22:52     ` [Intel-gfx] " Zanoni, Paulo R
@ 2022-05-23 19:05       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:05 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: Brost, Matthew, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, jason, Vetter, Daniel, christian.koenig

On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>
>I would really like to have more details here. The link provided points
>to new ioctls and we're not very familiar with those yet, so I think
>you should really clarify the interaction between the new additions
>here. Having some sample code would be really nice too.
>
>For Mesa at least (and I believe for the other drivers too) we always
>have a few exported buffers in every execbuf call, and we rely on the
>implicit synchronization provided by execbuf to make sure everything
>works. The execbuf ioctl also has some code to flush caches during
>implicit synchronization AFAIR, so I would guess we rely on it too and
>whatever else the Kernel does. Is that covered by the new ioctls?
>
>In addition, as far as I remember, one of the big improvements of
>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>But if making execbuf faster comes at the cost of requiring additional
>ioctls calls for implicit synchronization, which is required on ever
>execbuf call, then I wonder if we'll even get any faster at all.
>Comparing old execbuf vs plain new execbuf without the new required
>ioctls won't make sense.
>
>But maybe I'm wrong and we won't need to call these new ioctls around
>every single execbuf ioctl we submit? Again, more clarification and
>some code examples here would be really nice. This is a big change on
>an important part of the API, we should clarify the new expected usage.
>

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in
execbuff path is for implicit synchronization. And AFAIK, this work
from Jason is expected replace implict synchronization with new ioctls.
Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this
new mechanism from Jason is expected work for vl. For gl, there is a
question of whether it will be performant or not. But it is worth trying
that first. If it is not performant for gl, then only we can consider
adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>
>IMHO we really need to sort this and check all the assumptions before
>we commit to any interface. Again, implicit synchronization is
>something we rely on during *every* execbuf ioctl for most workloads.
>

Daniel's earlier feedback was that it is worth Mesa trying this new
mechanism for gl and see it that works. We want to avoid supporting
execlist support for implicit sync in vm_bind mode from the beginning
if it is going to be deemed not necessary.

>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>
>I seem to recall some conversations where we were told a bunch of
>ioctls would stop working or make no sense to call when using vm_bind.
>Can we please get a complete list of those? Bonus points if the Kernel
>starts telling us we just called something that makes no sense.
>

Which ioctls you are talking about here?
We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
already documented in this patch).

>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>I know we already discussed this, but just to document it publicly: the
>ideal case for user space would be that every BO is created as private
>but then we'd have an ioctl to convert it to non-private (without the
>need to have a non-private->private interface).
>
>An explanation on why we can't have an ioctl to mark as exported a
>buffer that was previously vm_private would be really appreciated.
>

Ok, I can some notes on that.
The reason being the fact that this require changing the dma-resv object
for gem object, hence the object locking also. This will add complications
as we have to sync with any pending operations. It might be easier for
UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

>Thanks,
>Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-23 19:05       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:05 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	Daniel, christian.koenig

On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>
>I would really like to have more details here. The link provided points
>to new ioctls and we're not very familiar with those yet, so I think
>you should really clarify the interaction between the new additions
>here. Having some sample code would be really nice too.
>
>For Mesa at least (and I believe for the other drivers too) we always
>have a few exported buffers in every execbuf call, and we rely on the
>implicit synchronization provided by execbuf to make sure everything
>works. The execbuf ioctl also has some code to flush caches during
>implicit synchronization AFAIR, so I would guess we rely on it too and
>whatever else the Kernel does. Is that covered by the new ioctls?
>
>In addition, as far as I remember, one of the big improvements of
>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>But if making execbuf faster comes at the cost of requiring additional
>ioctls calls for implicit synchronization, which is required on ever
>execbuf call, then I wonder if we'll even get any faster at all.
>Comparing old execbuf vs plain new execbuf without the new required
>ioctls won't make sense.
>
>But maybe I'm wrong and we won't need to call these new ioctls around
>every single execbuf ioctl we submit? Again, more clarification and
>some code examples here would be really nice. This is a big change on
>an important part of the API, we should clarify the new expected usage.
>

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in
execbuff path is for implicit synchronization. And AFAIK, this work
from Jason is expected replace implict synchronization with new ioctls.
Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this
new mechanism from Jason is expected work for vl. For gl, there is a
question of whether it will be performant or not. But it is worth trying
that first. If it is not performant for gl, then only we can consider
adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>
>IMHO we really need to sort this and check all the assumptions before
>we commit to any interface. Again, implicit synchronization is
>something we rely on during *every* execbuf ioctl for most workloads.
>

Daniel's earlier feedback was that it is worth Mesa trying this new
mechanism for gl and see it that works. We want to avoid supporting
execlist support for implicit sync in vm_bind mode from the beginning
if it is going to be deemed not necessary.

>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>
>I seem to recall some conversations where we were told a bunch of
>ioctls would stop working or make no sense to call when using vm_bind.
>Can we please get a complete list of those? Bonus points if the Kernel
>starts telling us we just called something that makes no sense.
>

Which ioctls you are talking about here?
We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
already documented in this patch).

>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>I know we already discussed this, but just to document it publicly: the
>ideal case for user space would be that every BO is created as private
>but then we'd have an ioctl to convert it to non-private (without the
>need to have a non-private->private interface).
>
>An explanation on why we can't have an ioctl to mark as exported a
>buffer that was previously vm_private would be really appreciated.
>

Ok, I can some notes on that.
The reason being the fact that this require changing the dma-resv object
for gem object, hence the object locking also. This will add complications
as we have to sync with any pending operations. It might be easier for
UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

>Thanks,
>Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-23 19:05       ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-05-23 19:08         ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:08 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: Brost, Matthew, Kenneth Graunke, intel-gfx, Wilson, Chris P,
	Hellstrom, Thomas, dri-devel, jason, Vetter, Daniel,
	christian.koenig

On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>v2: Add more documentation and format as per review comments
>>>    from Daniel.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..f1be560d313c
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,304 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>>>+specified address space (VM). These mappings (also referred to as persistent
>>>+mappings) will be persistent across multiple GPU submissions (execbuff calls)
>>>+issued by the UMD, without user having to provide a list of all required
>>>+mappings during each submission (as required by older execbuff mode).
>>>+
>>>+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>>>+to specify how the binding/unbinding should sync with other operations
>>>+like the GPU job submission. These fences will be timeline 'drm_syncobj's
>>>+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>>>+For Compute contexts, they will be user/memory fences (See struct
>>>+drm_i915_vm_bind_ext_user_fence).
>>>+
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+
>>>+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>>+async worker. The binding and unbinding will work like a special GPU engine.
>>>+The binding and unbinding operations are serialized and will wait on specified
>>>+input fences before the operation and will signal the output fences upon the
>>>+completion of the operation. Due to serialization, completion of an operation
>>>+will also indicate that all previous operations are also complete.
>>>+
>>>+VM_BIND features include:
>>>+
>>>+* Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+* VA mapping can map to a partial section of the BO (partial binding).
>>>+* Support capture of persistent mappings in the dump upon GPU error.
>>>+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  use cases will be helpful.
>>>+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>>>+* Support for userptr gem objects (no special uapi is required for this).
>>>+
>>>+Execbuff ioctl in VM_BIND mode
>>>+-------------------------------
>>>+The execbuff ioctl handling in VM_BIND mode differs significantly from the
>>>+older method. A VM in VM_BIND mode will not support older execbuff mode of
>>>+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>>>+no support for implicit sync. It is expected that the below work will be able
>>>+to support requirements of object dependency setting in all use cases:
>>>+
>>>+"dma-buf: Add an API for exporting sync files"
>>>+(https://lwn.net/Articles/859290/)
>>
>>I would really like to have more details here. The link provided points
>>to new ioctls and we're not very familiar with those yet, so I think
>>you should really clarify the interaction between the new additions
>>here. Having some sample code would be really nice too.
>>
>>For Mesa at least (and I believe for the other drivers too) we always
>>have a few exported buffers in every execbuf call, and we rely on the
>>implicit synchronization provided by execbuf to make sure everything
>>works. The execbuf ioctl also has some code to flush caches during
>>implicit synchronization AFAIR, so I would guess we rely on it too and
>>whatever else the Kernel does. Is that covered by the new ioctls?
>>
>>In addition, as far as I remember, one of the big improvements of
>>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>>But if making execbuf faster comes at the cost of requiring additional
>>ioctls calls for implicit synchronization, which is required on ever
>>execbuf call, then I wonder if we'll even get any faster at all.
>>Comparing old execbuf vs plain new execbuf without the new required
>>ioctls won't make sense.
>>
>>But maybe I'm wrong and we won't need to call these new ioctls around
>>every single execbuf ioctl we submit? Again, more clarification and
>>some code examples here would be really nice. This is a big change on
>>an important part of the API, we should clarify the new expected usage.
>>
>
>Thanks Paulo for the comments.
>
>In VM_BIND mode, the only reason we would need execlist support in
>execbuff path is for implicit synchronization. And AFAIK, this work
>from Jason is expected replace implict synchronization with new ioctls.
>Hence, VM_BIND mode will not be needing execlist support at all.
>
>Based on comments from Daniel and my offline sync with Jason, this
>new mechanism from Jason is expected work for vl. For gl, there is a
>question of whether it will be performant or not. But it is worth trying
>that first. If it is not performant for gl, then only we can consider
>adding implicit sync support back for VM_BIND mode.
>
>Daniel, Jason, Ken, any thoughts you can add here?

CC'ing Ken.

>
>>>+
>>>+This also means, we need an execbuff extension to pass in the batch
>>>+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>+
>>>+If at all execlist support in execbuff ioctl is deemed necessary for
>>>+implicit sync in certain use cases, then support can be added later.
>>
>>IMHO we really need to sort this and check all the assumptions before
>>we commit to any interface. Again, implicit synchronization is
>>something we rely on during *every* execbuf ioctl for most workloads.
>>
>
>Daniel's earlier feedback was that it is worth Mesa trying this new
>mechanism for gl and see it that works. We want to avoid supporting
>execlist support for implicit sync in vm_bind mode from the beginning
>if it is going to be deemed not necessary.
>
>>
>>>+In VM_BIND mode, VA allocation is completely managed by the user instead of
>>>+the i915 driver. Hence all VA assignment, eviction are not applicable in
>>>+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>>>+be using the i915_vma active reference tracking. It will instead use dma-resv
>>>+object for that (See `VM_BIND dma_resv usage`_).
>>>+
>>>+So, a lot of existing code in the execbuff path like relocations, VA evictions,
>>>+vma lookup table, implicit sync, vma active reference tracking etc., are not
>>>+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>>>+by clearly separating out the functionalities where the VM_BIND mode differs
>>>+from older method and they should be moved to separate files.
>>
>>I seem to recall some conversations where we were told a bunch of
>>ioctls would stop working or make no sense to call when using vm_bind.
>>Can we please get a complete list of those? Bonus points if the Kernel
>>starts telling us we just called something that makes no sense.
>>
>
>Which ioctls you are talking about here?
>We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
>already documented in this patch).
>
>>>+
>>>+VM_PRIVATE objects
>>>+-------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus, the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>I know we already discussed this, but just to document it publicly: the
>>ideal case for user space would be that every BO is created as private
>>but then we'd have an ioctl to convert it to non-private (without the
>>need to have a non-private->private interface).
>>
>>An explanation on why we can't have an ioctl to mark as exported a
>>buffer that was previously vm_private would be really appreciated.
>>
>
>Ok, I can some notes on that.
>The reason being the fact that this require changing the dma-resv object
>for gem object, hence the object locking also. This will add complications
>as we have to sync with any pending operations. It might be easier for
>UMDs to do it themselves by copying the object contexts to a new object.
>
>Niranjana
>
>>Thanks,
>>Paulo
>>
>>
>>>+
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+The locking design here supports the older (execlist based) execbuff mode, the
>>>+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>>>+system allocator support (See `Shared Virtual Memory (SVM) support`_).
>>>+The older execbuff mode and the newer VM_BIND mode without page faults manages
>>>+residency of backing storage using dma_fence. The VM_BIND mode with page faults
>>>+and the system allocator support do not use any dma_fence at all.
>>>+
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>>>+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>>>+   mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple page fault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+   The older execbuff mode of binding do not need this lock.
>>>+
>>>+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>>>+   be held while binding/unbinding a vma in the async worker and while updating
>>>+   dma-resv fence list of an object. Note that private BOs of a VM will all
>>>+   share a dma-resv object.
>>>+
>>>+   The future system allocator support will use the HMM prescribed locking
>>>+   instead.
>>>+
>>>+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>>>+   invalidated vmas (due to eviction and userptr invalidation) etc.
>>>+
>>>+When GPU page faults are supported, the execbuff path do not take any of these
>>>+locks. There we will simply smash the new batch buffer address into the ring and
>>>+then tell the scheduler run that. The lock taking only happens from the page
>>>+fault handler, where we take lock-A in read mode, whichever lock-B we need to
>>>+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>>>+system allocator) and some additional locks (lock-D) for taking care of page
>>>+table races. Page fault mode should not need to ever manipulate the vm lists,
>>>+so won't ever need lock-C.
>>>+
>>>+VM_BIND LRU handling
>>>+---------------------
>>>+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>>>+performance degradation. We will also need support for bulk LRU movement of
>>>+VM_BIND objects to avoid additional latencies in execbuff path.
>>>+
>>>+The page table pages are similar to VM_BIND mapped objects (See
>>>+`Evictable page table allocations`_) and are maintained per VM and needs to
>>>+be pinned in memory when VM is made active (ie., upon an execbuff call with
>>>+that VM). So, bulk LRU movement of page table pages is also needed.
>>>+
>>>+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>>>+over to the ttm LRU in some fashion to make sure we once again have a reasonable
>>>+and consistent memory aging and reclaim architecture.
>>>+
>>>+VM_BIND dma_resv usage
>>>+-----------------------
>>>+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>>>+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>>>+over sync (See enum dma_resv_usage). One can override it with either
>>>+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>>>+setting (either through explicit or implicit mechanism).
>>>+
>>>+When vm_bind is called for a non-private object while the VM is already
>>>+active, the fences need to be copied from VM's shared dma-resv object
>>>+(common to all private objects of the VM) to this non-private object.
>>>+If this results in performance degradation, then some optimization will
>>>+be needed here. This is not a problem for VM's private objects as they use
>>>+shared dma-resv object which is always updated on each execbuff submission.
>>>+
>>>+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>>>+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>>>+older i915_vma active reference tracking which is deprecated. This should be
>>>+easier to get it working with the current TTM backend. We can remove the
>>>+i915_vma active reference tracking fully while supporting TTM backend for igfx.
>>>+
>>>+Evictable page table allocations
>>>+---------------------------------
>>>+Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+mapped objects. Page table pages are similar to persistent mappings of a
>>>+VM (difference here are that the page table pages will not have an i915_vma
>>>+structure and after swapping pages back in, parent page link needs to be
>>>+updated).
>>>+
>>>+Mesa use case
>>>+--------------
>>>+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>>>+hence improving performance of CPU-bound applications. It also allows us to
>>>+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>>>+reducing CPU overhead becomes more impactful.
>>>+
>>>+
>>>+VM_BIND Compute support
>>>+========================
>>>+
>>>+User/Memory Fence
>>>+------------------
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter. User/Memory fence is a <address, value> pair. To signal the
>>>+user fence, specified value will be written at the specified virtual address
>>>+and wakeup the waiting process. User can wait on a user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal a user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>>>+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>>>+context creation. The dma-fence based user interfaces like gem_wait ioctl and
>>>+execbuff out fence are not allowed on long running contexts. Implicit sync is
>>>+not valid as well and is anyway not supported in VM_BIND mode.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context preempt fence (also called suspend fence) proxying
>>>+as i915_request fence. This suspend fence is enabled when someone tries to wait
>>>+on it, which then triggers the context preemption.
>>>+
>>>+As this support for context suspension using a preempt fence and the resume work
>>>+for the compute mode contexts can get tricky to get it right, it is better to
>>>+add this support in drm scheduler so that multiple drivers can make use of it.
>>>+That means, it will have a dependency on i915 drm scheduler conversion with GuC
>>>+scheduler backend. This should be fine, as the plan is to support compute mode
>>>+contexts only with GuC scheduler backend (at least initially). This is much
>>>+easier to support with VM_BIND mode compared to the current heavier execbuff
>>>+path resource attachment.
>>>+
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. This is made possible by VM_BIND is not being synchronized against
>>>+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>>>+submitted jobs.
>>>+
>>>+Other VM_BIND use cases
>>>+========================
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debugged) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+GPU page faults
>>>+----------------
>>>+GPU page faults when supported (in future), will only be supported in the
>>>+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>>>+binding will require using dma-fence to ensure residency, the GPU page faults
>>>+mode when supported, will not use any dma-fence as residency is purely managed
>>>+by installing and removing/invalidating page table entries.
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only mapping, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface. SVM is only supported with GPU page
>>>+faults enabled.
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+use cases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>>>+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>>>+  is active or not.
>>>+
>>>+
>>>+VM_BIND UAPI
>>>+=============
>>>+
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-23 19:08         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:08 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: Kenneth Graunke, intel-gfx, Wilson, Chris P, Hellstrom, Thomas,
	dri-devel, Vetter, Daniel, christian.koenig

On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>v2: Add more documentation and format as per review comments
>>>    from Daniel.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..f1be560d313c
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,304 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>>>+specified address space (VM). These mappings (also referred to as persistent
>>>+mappings) will be persistent across multiple GPU submissions (execbuff calls)
>>>+issued by the UMD, without user having to provide a list of all required
>>>+mappings during each submission (as required by older execbuff mode).
>>>+
>>>+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>>>+to specify how the binding/unbinding should sync with other operations
>>>+like the GPU job submission. These fences will be timeline 'drm_syncobj's
>>>+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>>>+For Compute contexts, they will be user/memory fences (See struct
>>>+drm_i915_vm_bind_ext_user_fence).
>>>+
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+
>>>+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>>+async worker. The binding and unbinding will work like a special GPU engine.
>>>+The binding and unbinding operations are serialized and will wait on specified
>>>+input fences before the operation and will signal the output fences upon the
>>>+completion of the operation. Due to serialization, completion of an operation
>>>+will also indicate that all previous operations are also complete.
>>>+
>>>+VM_BIND features include:
>>>+
>>>+* Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+* VA mapping can map to a partial section of the BO (partial binding).
>>>+* Support capture of persistent mappings in the dump upon GPU error.
>>>+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  use cases will be helpful.
>>>+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>>>+* Support for userptr gem objects (no special uapi is required for this).
>>>+
>>>+Execbuff ioctl in VM_BIND mode
>>>+-------------------------------
>>>+The execbuff ioctl handling in VM_BIND mode differs significantly from the
>>>+older method. A VM in VM_BIND mode will not support older execbuff mode of
>>>+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>>>+no support for implicit sync. It is expected that the below work will be able
>>>+to support requirements of object dependency setting in all use cases:
>>>+
>>>+"dma-buf: Add an API for exporting sync files"
>>>+(https://lwn.net/Articles/859290/)
>>
>>I would really like to have more details here. The link provided points
>>to new ioctls and we're not very familiar with those yet, so I think
>>you should really clarify the interaction between the new additions
>>here. Having some sample code would be really nice too.
>>
>>For Mesa at least (and I believe for the other drivers too) we always
>>have a few exported buffers in every execbuf call, and we rely on the
>>implicit synchronization provided by execbuf to make sure everything
>>works. The execbuf ioctl also has some code to flush caches during
>>implicit synchronization AFAIR, so I would guess we rely on it too and
>>whatever else the Kernel does. Is that covered by the new ioctls?
>>
>>In addition, as far as I remember, one of the big improvements of
>>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>>But if making execbuf faster comes at the cost of requiring additional
>>ioctls calls for implicit synchronization, which is required on ever
>>execbuf call, then I wonder if we'll even get any faster at all.
>>Comparing old execbuf vs plain new execbuf without the new required
>>ioctls won't make sense.
>>
>>But maybe I'm wrong and we won't need to call these new ioctls around
>>every single execbuf ioctl we submit? Again, more clarification and
>>some code examples here would be really nice. This is a big change on
>>an important part of the API, we should clarify the new expected usage.
>>
>
>Thanks Paulo for the comments.
>
>In VM_BIND mode, the only reason we would need execlist support in
>execbuff path is for implicit synchronization. And AFAIK, this work
>from Jason is expected replace implict synchronization with new ioctls.
>Hence, VM_BIND mode will not be needing execlist support at all.
>
>Based on comments from Daniel and my offline sync with Jason, this
>new mechanism from Jason is expected work for vl. For gl, there is a
>question of whether it will be performant or not. But it is worth trying
>that first. If it is not performant for gl, then only we can consider
>adding implicit sync support back for VM_BIND mode.
>
>Daniel, Jason, Ken, any thoughts you can add here?

CC'ing Ken.

>
>>>+
>>>+This also means, we need an execbuff extension to pass in the batch
>>>+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>+
>>>+If at all execlist support in execbuff ioctl is deemed necessary for
>>>+implicit sync in certain use cases, then support can be added later.
>>
>>IMHO we really need to sort this and check all the assumptions before
>>we commit to any interface. Again, implicit synchronization is
>>something we rely on during *every* execbuf ioctl for most workloads.
>>
>
>Daniel's earlier feedback was that it is worth Mesa trying this new
>mechanism for gl and see it that works. We want to avoid supporting
>execlist support for implicit sync in vm_bind mode from the beginning
>if it is going to be deemed not necessary.
>
>>
>>>+In VM_BIND mode, VA allocation is completely managed by the user instead of
>>>+the i915 driver. Hence all VA assignment, eviction are not applicable in
>>>+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>>>+be using the i915_vma active reference tracking. It will instead use dma-resv
>>>+object for that (See `VM_BIND dma_resv usage`_).
>>>+
>>>+So, a lot of existing code in the execbuff path like relocations, VA evictions,
>>>+vma lookup table, implicit sync, vma active reference tracking etc., are not
>>>+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>>>+by clearly separating out the functionalities where the VM_BIND mode differs
>>>+from older method and they should be moved to separate files.
>>
>>I seem to recall some conversations where we were told a bunch of
>>ioctls would stop working or make no sense to call when using vm_bind.
>>Can we please get a complete list of those? Bonus points if the Kernel
>>starts telling us we just called something that makes no sense.
>>
>
>Which ioctls you are talking about here?
>We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
>already documented in this patch).
>
>>>+
>>>+VM_PRIVATE objects
>>>+-------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus, the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>I know we already discussed this, but just to document it publicly: the
>>ideal case for user space would be that every BO is created as private
>>but then we'd have an ioctl to convert it to non-private (without the
>>need to have a non-private->private interface).
>>
>>An explanation on why we can't have an ioctl to mark as exported a
>>buffer that was previously vm_private would be really appreciated.
>>
>
>Ok, I can some notes on that.
>The reason being the fact that this require changing the dma-resv object
>for gem object, hence the object locking also. This will add complications
>as we have to sync with any pending operations. It might be easier for
>UMDs to do it themselves by copying the object contexts to a new object.
>
>Niranjana
>
>>Thanks,
>>Paulo
>>
>>
>>>+
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+The locking design here supports the older (execlist based) execbuff mode, the
>>>+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>>>+system allocator support (See `Shared Virtual Memory (SVM) support`_).
>>>+The older execbuff mode and the newer VM_BIND mode without page faults manages
>>>+residency of backing storage using dma_fence. The VM_BIND mode with page faults
>>>+and the system allocator support do not use any dma_fence at all.
>>>+
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>>>+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>>>+   mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple page fault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+   The older execbuff mode of binding do not need this lock.
>>>+
>>>+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>>>+   be held while binding/unbinding a vma in the async worker and while updating
>>>+   dma-resv fence list of an object. Note that private BOs of a VM will all
>>>+   share a dma-resv object.
>>>+
>>>+   The future system allocator support will use the HMM prescribed locking
>>>+   instead.
>>>+
>>>+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>>>+   invalidated vmas (due to eviction and userptr invalidation) etc.
>>>+
>>>+When GPU page faults are supported, the execbuff path do not take any of these
>>>+locks. There we will simply smash the new batch buffer address into the ring and
>>>+then tell the scheduler run that. The lock taking only happens from the page
>>>+fault handler, where we take lock-A in read mode, whichever lock-B we need to
>>>+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>>>+system allocator) and some additional locks (lock-D) for taking care of page
>>>+table races. Page fault mode should not need to ever manipulate the vm lists,
>>>+so won't ever need lock-C.
>>>+
>>>+VM_BIND LRU handling
>>>+---------------------
>>>+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>>>+performance degradation. We will also need support for bulk LRU movement of
>>>+VM_BIND objects to avoid additional latencies in execbuff path.
>>>+
>>>+The page table pages are similar to VM_BIND mapped objects (See
>>>+`Evictable page table allocations`_) and are maintained per VM and needs to
>>>+be pinned in memory when VM is made active (ie., upon an execbuff call with
>>>+that VM). So, bulk LRU movement of page table pages is also needed.
>>>+
>>>+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>>>+over to the ttm LRU in some fashion to make sure we once again have a reasonable
>>>+and consistent memory aging and reclaim architecture.
>>>+
>>>+VM_BIND dma_resv usage
>>>+-----------------------
>>>+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>>>+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>>>+over sync (See enum dma_resv_usage). One can override it with either
>>>+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>>>+setting (either through explicit or implicit mechanism).
>>>+
>>>+When vm_bind is called for a non-private object while the VM is already
>>>+active, the fences need to be copied from VM's shared dma-resv object
>>>+(common to all private objects of the VM) to this non-private object.
>>>+If this results in performance degradation, then some optimization will
>>>+be needed here. This is not a problem for VM's private objects as they use
>>>+shared dma-resv object which is always updated on each execbuff submission.
>>>+
>>>+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>>>+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>>>+older i915_vma active reference tracking which is deprecated. This should be
>>>+easier to get it working with the current TTM backend. We can remove the
>>>+i915_vma active reference tracking fully while supporting TTM backend for igfx.
>>>+
>>>+Evictable page table allocations
>>>+---------------------------------
>>>+Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+mapped objects. Page table pages are similar to persistent mappings of a
>>>+VM (difference here are that the page table pages will not have an i915_vma
>>>+structure and after swapping pages back in, parent page link needs to be
>>>+updated).
>>>+
>>>+Mesa use case
>>>+--------------
>>>+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>>>+hence improving performance of CPU-bound applications. It also allows us to
>>>+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>>>+reducing CPU overhead becomes more impactful.
>>>+
>>>+
>>>+VM_BIND Compute support
>>>+========================
>>>+
>>>+User/Memory Fence
>>>+------------------
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter. User/Memory fence is a <address, value> pair. To signal the
>>>+user fence, specified value will be written at the specified virtual address
>>>+and wakeup the waiting process. User can wait on a user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal a user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>>>+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>>>+context creation. The dma-fence based user interfaces like gem_wait ioctl and
>>>+execbuff out fence are not allowed on long running contexts. Implicit sync is
>>>+not valid as well and is anyway not supported in VM_BIND mode.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context preempt fence (also called suspend fence) proxying
>>>+as i915_request fence. This suspend fence is enabled when someone tries to wait
>>>+on it, which then triggers the context preemption.
>>>+
>>>+As this support for context suspension using a preempt fence and the resume work
>>>+for the compute mode contexts can get tricky to get it right, it is better to
>>>+add this support in drm scheduler so that multiple drivers can make use of it.
>>>+That means, it will have a dependency on i915 drm scheduler conversion with GuC
>>>+scheduler backend. This should be fine, as the plan is to support compute mode
>>>+contexts only with GuC scheduler backend (at least initially). This is much
>>>+easier to support with VM_BIND mode compared to the current heavier execbuff
>>>+path resource attachment.
>>>+
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. This is made possible by VM_BIND is not being synchronized against
>>>+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>>>+submitted jobs.
>>>+
>>>+Other VM_BIND use cases
>>>+========================
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debugged) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+GPU page faults
>>>+----------------
>>>+GPU page faults when supported (in future), will only be supported in the
>>>+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>>>+binding will require using dma-fence to ensure residency, the GPU page faults
>>>+mode when supported, will not use any dma-fence as residency is purely managed
>>>+by installing and removing/invalidating page table entries.
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only mapping, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface. SVM is only supported with GPU page
>>>+faults enabled.
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+use cases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>>>+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>>>+  is active or not.
>>>+
>>>+
>>>+VM_BIND UAPI
>>>+=============
>>>+
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-19 23:07   ` [Intel-gfx] " Zanoni, Paulo R
@ 2022-05-23 19:19     ` Niranjana Vishwanathapura
  2022-06-01  9:02       ` Dave Airlie
  0 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:19 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	Daniel, christian.koenig

On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND and related uapi definitions
>>
>> v2: Ensure proper kernel-doc formatting with cross references.
>>     Also add new uapi and documentation as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> new file mode 100644
>> index 000000000000..589c0a009107
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> @@ -0,0 +1,399 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/**
>> + * DOC: I915_PARAM_HAS_VM_BIND
>> + *
>> + * VM_BIND feature availability.
>> + * See typedef drm_i915_getparam_t param.
>> + */
>> +#define I915_PARAM_HAS_VM_BIND               57
>> +
>> +/**
>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> + *
>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> + * See struct drm_i915_gem_vm_control flags.
>> + *
>> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>> + * to pass in the batch buffer addresses.
>> + *
>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>> + */
>
>From that description, it seems we have:
>
>struct drm_i915_gem_execbuffer2 {
>        __u64 buffers_ptr;              -> must be 0 (new)
>        __u32 buffer_count;             -> must be 0 (new)
>        __u32 batch_start_offset;       -> must be 0 (new)
>        __u32 batch_len;                -> must be 0 (new)
>        __u32 DR1;                      -> must be 0 (old)
>        __u32 DR4;                      -> must be 0 (old)
>        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>        __u64 flags;                    -> some flags must be 0 (new)
>        __u64 rsvd1; (context info)     -> repurposed field (old)
>        __u64 rsvd2;                    -> unused
>};
>
>Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>of adding even more complexity to an already abused interface? While
>the Vulkan-like extension thing is really nice, I don't think what
>we're doing here is extending the ioctl usage, we're completely
>changing how the base struct should be interpreted based on how the VM
>was created (which is an entirely different ioctl).
>
>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>already at -6 without these changes. I think after vm_bind we'll need
>to create a -11 entry just to deal with this ioctl.
>

The only change here is removing the execlist support for VM_BIND
mode (other than natual extensions).
Adding a new execbuffer3 was considered, but I think we need to be careful
with that as that goes beyond the VM_BIND support, including any future
requirements (as we don't want an execbuffer4 after VM_BIND).

Niranjana

>
>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND       (1 << 0)
>+
>+/**
>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>+ *
>+ * Flag to declare context as long running.
>+ * See struct drm_i915_gem_context_create_ext flags.
>+ *
>+ * Usage of dma-fence expects that they complete in reasonable amount of time.
>+ * Compute on the other hand can be long running. Hence it is not appropriate
>+ * for compute contexts to export request completion dma-fence to user.
>+ * The dma-fence usage will be limited to in-kernel consumption only.
>+ * Compute contexts need to use user/memory fence.
>+ *
>+ * So, long running contexts do not support output fences. Hence,
>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
>+ * to be not used.
>+ *
>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
>+ * to long running contexts.
>+ */
>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>+
>+/* VM_BIND related ioctls */
>+#define DRM_I915_GEM_VM_BIND           0x3d
>+#define DRM_I915_GEM_VM_UNBIND         0x3e
>+#define DRM_I915_GEM_WAIT_USER_FENCE   0x3f
>+
>+#define DRM_IOCTL_I915_GEM_VM_BIND             DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>+#define DRM_IOCTL_I915_GEM_VM_UNBIND           DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE     DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
>+
>+/**
>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>+ *
>+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
>+ * virtual address (VA) range to the section of an object that should be bound
>+ * in the device page table of the specified address space (VM).
>+ * The VA range specified must be unique (ie., not currently bound) and can
>+ * be mapped to whole object or a section of the object (partial binding).
>+ * Multiple VA mappings can be created to the same section of the object
>+ * (aliasing).
>+ */
>+struct drm_i915_gem_vm_bind {
>+       /** @vm_id: VM (address space) id to bind */
>+       __u32 vm_id;
>+
>+       /** @handle: Object handle */
>+       __u32 handle;
>+
>+       /** @start: Virtual Address start to bind */
>+       __u64 start;
>+
>+       /** @offset: Offset in object to bind */
>+       __u64 offset;
>+
>+       /** @length: Length of mapping to bind */
>+       __u64 length;
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_GEM_VM_BIND_READONLY:
>+        * Mapping is read-only.
>+        *
>+        * I915_GEM_VM_BIND_CAPTURE:
>+        * Capture this mapping in the dump upon GPU error.
>+        */
>+       __u64 flags;
>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>+
>+       /** @extensions: 0-terminated chain of extensions for this mapping. */
>+       __u64 extensions;
>+};
>+
>+/**
>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>+ *
>+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
>+ * address (VA) range that should be unbound from the device page table of the
>+ * specified address space (VM). The specified VA range must match one of the
>+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>+ * completion.
>+ */
>+struct drm_i915_gem_vm_unbind {
>+       /** @vm_id: VM (address space) id to bind */
>+       __u32 vm_id;
>+
>+       /** @rsvd: Reserved for future use; must be zero. */
>+       __u32 rsvd;
>+
>+       /** @start: Virtual Address start to unbind */
>+       __u64 start;
>+
>+       /** @length: Length of mapping to unbind */
>+       __u64 length;
>+
>+       /** @flags: reserved for future usage, currently MBZ */
>+       __u64 flags;
>+
>+       /** @extensions: 0-terminated chain of extensions for this mapping. */
>+       __u64 extensions;
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
>+ * or the vm_unbind work.
>+ *
>+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
>+ * before starting the binding or unbinding.
>+ *
>+ * The vm_bind or vm_unbind async worker will signal the returned output fence
>+ * after the completion of binding or unbinding.
>+ */
>+struct drm_i915_vm_bind_fence {
>+       /** @handle: User's handle for a drm_syncobj to wait on or signal. */
>+       __u32 handle;
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_VM_BIND_FENCE_WAIT:
>+        * Wait for the input fence before binding/unbinding
>+        *
>+        * I915_VM_BIND_FENCE_SIGNAL:
>+        * Return bind/unbind completion fence as output
>+        */
>+       __u32 flags;
>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
>+ * and vm_unbind.
>+ *
>+ * This structure describes an array of timeline drm_syncobj and associated
>+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
>+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
>+ */
>+struct drm_i915_vm_bind_ext_timeline_fences {
>+#define I915_VM_BIND_EXT_timeline_FENCES       0
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /**
>+        * @fence_count: Number of elements in the @handles_ptr & @value_ptr
>+        * arrays.
>+        */
>+       __u64 fence_count;
>+
>+       /**
>+        * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
>+        * of length @fence_count.
>+        */
>+       __u64 handles_ptr;
>+
>+       /**
>+        * @values_ptr: Pointer to an array of u64 values of length
>+        * @fence_count.
>+        * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>+        * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>+        * binary one.
>+        */
>+       __u64 values_ptr;
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
>+ * vm_bind or the vm_unbind work.
>+ *
>+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
>+ * @addr to become equal to @val) before starting the binding or unbinding.
>+ *
>+ * The vm_bind or vm_unbind async worker will signal the output fence after
>+ * the completion of binding or unbinding by writing @val to memory location at
>+ * @addr
>+ */
>+struct drm_i915_vm_bind_user_fence {
>+       /** @addr: User/Memory fence qword aligned process virtual address */
>+       __u64 addr;
>+
>+       /** @val: User/Memory fence value to be written after bind completion */
>+       __u64 val;
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_VM_BIND_USER_FENCE_WAIT:
>+        * Wait for the input fence before binding/unbinding
>+        *
>+        * I915_VM_BIND_USER_FENCE_SIGNAL:
>+        * Return bind/unbind completion fence as output
>+        */
>+       __u32 flags;
>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>+       (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
>+ * and vm_unbind.
>+ *
>+ * These user fences can be input or output fences
>+ * (See struct drm_i915_vm_bind_user_fence).
>+ */
>+struct drm_i915_vm_bind_ext_user_fence {
>+#define I915_VM_BIND_EXT_USER_FENCES   1
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /** @fence_count: Number of elements in the @user_fence_ptr array. */
>+       __u64 fence_count;
>+
>+       /**
>+        * @user_fence_ptr: Pointer to an array of
>+        * struct drm_i915_vm_bind_user_fence of length @fence_count.
>+        */
>+       __u64 user_fence_ptr;
>+};
>+
>+/**
>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
>+ * gpu virtual addresses.
>+ *
>+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
>+ * must always be appended in the VM_BIND mode and it will be an error to
>+ * append this extension in older non-VM_BIND mode.
>+ */
>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /** @count: Number of addresses in the addr array. */
>+       __u32 count;
>+
>+       /** @addr: An array of batch gpu virtual addresses. */
>+       __u64 addr[0];
>+};
>+
>+/**
>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
>+ * signaling extension.
>+ *
>+ * This extension allows user to attach a user fence (@addr, @value pair) to an
>+ * execbuf to be signaled by the command streamer after the completion of first
>+ * level batch, by writing the @value at specified @addr and triggering an
>+ * interrupt.
>+ * User can either poll for this user fence to signal or can also wait on it
>+ * with i915_gem_wait_user_fence ioctl.
>+ * This is very much usefaul for long running contexts where waiting on dma-fence
>+ * by user (like i915_gem_wait ioctl) is not supported.
>+ */
>+struct drm_i915_gem_execbuffer_ext_user_fence {
>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE         2
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /**
>+        * @addr: User/Memory fence qword aligned GPU virtual address.
>+        *
>+        * Address has to be a valid GPU virtual address at the time of
>+        * first level batch completion.
>+        */
>+       __u64 addr;
>+
>+       /**
>+        * @value: User/Memory fence Value to be written to above address
>+        * after first level batch completes.
>+        */
>+       __u64 value;
>+
>+       /** @rsvd: Reserved for future extensions, MBZ */
>+       __u64 rsvd;
>+};
>+
>+/**
>+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
>+ * private to the specified VM.
>+ *
>+ * See struct drm_i915_gem_create_ext.
>+ */
>+struct drm_i915_gem_create_ext_vm_private {
>+#define I915_GEM_CREATE_EXT_VM_PRIVATE         2
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /** @vm_id: Id of the VM to which the object is private */
>+       __u32 vm_id;
>+};
>+
>+/**
>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>+ *
>+ * User/Memory fence can be woken up either by:
>+ *
>+ * 1. GPU context indicated by @ctx_id, or,
>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>+ *    @ctx_id is ignored when this flag is set.
>+ *
>+ * Wakeup condition is,
>+ * ``((*addr & mask) op (value & mask))``
>+ *
>+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
>+ */
>+struct drm_i915_gem_wait_user_fence {
>+       /** @extensions: Zero-terminated chain of extensions. */
>+       __u64 extensions;
>+
>+       /** @addr: User/Memory fence address */
>+       __u64 addr;
>+
>+       /** @ctx_id: Id of the Context which will signal the fence. */
>+       __u32 ctx_id;
>+
>+       /** @op: Wakeup condition operator */
>+       __u16 op;
>+#define I915_UFENCE_WAIT_EQ      0
>+#define I915_UFENCE_WAIT_NEQ     1
>+#define I915_UFENCE_WAIT_GT      2
>+#define I915_UFENCE_WAIT_GTE     3
>+#define I915_UFENCE_WAIT_LT      4
>+#define I915_UFENCE_WAIT_LTE     5
>+#define I915_UFENCE_WAIT_BEFORE  6
>+#define I915_UFENCE_WAIT_AFTER   7
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_UFENCE_WAIT_SOFT:
>+        *
>+        * To be woken up by i915 driver async worker (not by GPU).
>+        *
>+        * I915_UFENCE_WAIT_ABSTIME:
>+        *
>+        * Wait timeout specified as absolute time.
>+        */
>+       __u16 flags;
>+#define I915_UFENCE_WAIT_SOFT    0x1
>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>+
>+       /** @value: Wakeup value */
>+       __u64 value;
>+
>+       /** @mask: Wakeup mask */
>+       __u64 mask;
>+#define I915_UFENCE_WAIT_U8     0xffu
>+#define I915_UFENCE_WAIT_U16    0xffffu
>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>+
>+       /**
>+        * @timeout: Wait timeout in nanoseconds.
>+        *
>+        * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
>+        * absolute time in nsec.
>+        */
>+       __s64 timeout;
>+};
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-19 22:52     ` [Intel-gfx] " Zanoni, Paulo R
  (?)
  (?)
@ 2022-05-24 10:08     ` Lionel Landwerlin
  -1 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-05-24 10:08 UTC (permalink / raw)
  To: Zanoni, Paulo R, dri-devel, Vetter, Daniel, Vishwanathapura,
	Niranjana, intel-gfx
  Cc: Hellstrom, Thomas, christian.koenig, Wilson, Chris P

On 20/05/2022 01:52, Zanoni, Paulo R wrote:
> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>      from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
> I would really like to have more details here. The link provided points
> to new ioctls and we're not very familiar with those yet, so I think
> you should really clarify the interaction between the new additions
> here. Having some sample code would be really nice too.
>
> For Mesa at least (and I believe for the other drivers too) we always
> have a few exported buffers in every execbuf call, and we rely on the
> implicit synchronization provided by execbuf to make sure everything
> works. The execbuf ioctl also has some code to flush caches during
> implicit synchronization AFAIR, so I would guess we rely on it too and
> whatever else the Kernel does. Is that covered by the new ioctls?
>
> In addition, as far as I remember, one of the big improvements of
> vm_bind was that it would help reduce ioctl latency and cpu overhead.
> But if making execbuf faster comes at the cost of requiring additional
> ioctls calls for implicit synchronization, which is required on ever
> execbuf call, then I wonder if we'll even get any faster at all.
> Comparing old execbuf vs plain new execbuf without the new required
> ioctls won't make sense.
> But maybe I'm wrong and we won't need to call these new ioctls around
> every single execbuf ioctl we submit? Again, more clarification and
> some code examples here would be really nice. This is a big change on
> an important part of the API, we should clarify the new expected usage.


Hey Paulo,


I think in the case of X11/Wayland, we'll be doing 1 or 2 extra ioctls 
per frame which seems pretty reasonable.

Essentially we need to set the dependencies on the buffer we´re going to 
tell the display engine (gnome-shell/kde/bare-display-hw) to use.


In the Vulkan case, we're trading building execbuffer lists of 
potentially thousands of buffers for every single submission versus 1 or 
2 ioctls for a single item when doing vkQueuePresent() (which happens 
less often than we do execbuffer ioctls).

That seems like a good trade off and doesn't look like a lot more work 
than explicit fencing where we would have to send associated fences.


Here is the Mesa MR associated with this : 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037


-Lionel


>
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
> IMHO we really need to sort this and check all the assumptions before
> we commit to any interface. Again, implicit synchronization is
> something we rely on during *every* execbuf ioctl for most workloads.
>
>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
> I seem to recall some conversations where we were told a bunch of
> ioctls would stop working or make no sense to call when using vm_bind.
> Can we please get a complete list of those? Bonus points if the Kernel
> starts telling us we just called something that makes no sense.
>
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
> I know we already discussed this, but just to document it publicly: the
> ideal case for user space would be that every BO is created as private
> but then we'd have an ioctl to convert it to non-private (without the
> need to have a non-private->private interface).
>
> An explanation on why we can't have an ioctl to mark as exported a
> buffer that was previously vm_private would be really appreciated.
>
> Thanks,
> Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>   .. toctree::
>>   
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>       i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-23 19:19     ` Niranjana Vishwanathapura
@ 2022-06-01  9:02       ` Dave Airlie
  2022-06-01  9:27           ` Daniel Vetter
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Airlie @ 2022-06-01  9:02 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:
>
> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> >> VM_BIND and related uapi definitions
> >>
> >> v2: Ensure proper kernel-doc formatting with cross references.
> >>     Also add new uapi and documentation as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
> >>  1 file changed, 399 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> >> new file mode 100644
> >> index 000000000000..589c0a009107
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> >> @@ -0,0 +1,399 @@
> >> +/* SPDX-License-Identifier: MIT */
> >> +/*
> >> + * Copyright © 2022 Intel Corporation
> >> + */
> >> +
> >> +/**
> >> + * DOC: I915_PARAM_HAS_VM_BIND
> >> + *
> >> + * VM_BIND feature availability.
> >> + * See typedef drm_i915_getparam_t param.
> >> + */
> >> +#define I915_PARAM_HAS_VM_BIND               57
> >> +
> >> +/**
> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> + *
> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> >> + * See struct drm_i915_gem_vm_control flags.
> >> + *
> >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> >> + * to pass in the batch buffer addresses.
> >> + *
> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> >> + */
> >
> >From that description, it seems we have:
> >
> >struct drm_i915_gem_execbuffer2 {
> >        __u64 buffers_ptr;              -> must be 0 (new)
> >        __u32 buffer_count;             -> must be 0 (new)
> >        __u32 batch_start_offset;       -> must be 0 (new)
> >        __u32 batch_len;                -> must be 0 (new)
> >        __u32 DR1;                      -> must be 0 (old)
> >        __u32 DR4;                      -> must be 0 (old)
> >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
> >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
> >        __u64 flags;                    -> some flags must be 0 (new)
> >        __u64 rsvd1; (context info)     -> repurposed field (old)
> >        __u64 rsvd2;                    -> unused
> >};
> >
> >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
> >of adding even more complexity to an already abused interface? While
> >the Vulkan-like extension thing is really nice, I don't think what
> >we're doing here is extending the ioctl usage, we're completely
> >changing how the base struct should be interpreted based on how the VM
> >was created (which is an entirely different ioctl).
> >
> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> >already at -6 without these changes. I think after vm_bind we'll need
> >to create a -11 entry just to deal with this ioctl.
> >
>
> The only change here is removing the execlist support for VM_BIND
> mode (other than natual extensions).
> Adding a new execbuffer3 was considered, but I think we need to be careful
> with that as that goes beyond the VM_BIND support, including any future
> requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different
than adding new ioctls.

I definitely think this deserves an execbuffer3 without even
considering future requirements. Just  to burn down the old
requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
older sw on execbuf2 for ever.

Dave.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-01  9:02       ` Dave Airlie
@ 2022-06-01  9:27           ` Daniel Vetter
  0 siblings, 0 replies; 121+ messages in thread
From: Daniel Vetter @ 2022-06-01  9:27 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, Niranjana Vishwanathapura,
	christian.koenig

On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>
> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com> wrote:
> >
> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> > >> VM_BIND and related uapi definitions
> > >>
> > >> v2: Ensure proper kernel-doc formatting with cross references.
> > >>     Also add new uapi and documentation as per review comments
> > >>     from Daniel.
> > >>
> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > >> ---
> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
> > >>  1 file changed, 399 insertions(+)
> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> > >>
> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> new file mode 100644
> > >> index 000000000000..589c0a009107
> > >> --- /dev/null
> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> @@ -0,0 +1,399 @@
> > >> +/* SPDX-License-Identifier: MIT */
> > >> +/*
> > >> + * Copyright © 2022 Intel Corporation
> > >> + */
> > >> +
> > >> +/**
> > >> + * DOC: I915_PARAM_HAS_VM_BIND
> > >> + *
> > >> + * VM_BIND feature availability.
> > >> + * See typedef drm_i915_getparam_t param.
> > >> + */
> > >> +#define I915_PARAM_HAS_VM_BIND               57
> > >> +
> > >> +/**
> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> > >> + *
> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> > >> + * See struct drm_i915_gem_vm_control flags.
> > >> + *
> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> > >> + * to pass in the batch buffer addresses.
> > >> + *
> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> > >> + */
> > >
> > >From that description, it seems we have:
> > >
> > >struct drm_i915_gem_execbuffer2 {
> > >        __u64 buffers_ptr;              -> must be 0 (new)
> > >        __u32 buffer_count;             -> must be 0 (new)
> > >        __u32 batch_start_offset;       -> must be 0 (new)
> > >        __u32 batch_len;                -> must be 0 (new)
> > >        __u32 DR1;                      -> must be 0 (old)
> > >        __u32 DR4;                      -> must be 0 (old)
> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
> > >        __u64 flags;                    -> some flags must be 0 (new)
> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
> > >        __u64 rsvd2;                    -> unused
> > >};
> > >
> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
> > >of adding even more complexity to an already abused interface? While
> > >the Vulkan-like extension thing is really nice, I don't think what
> > >we're doing here is extending the ioctl usage, we're completely
> > >changing how the base struct should be interpreted based on how the VM
> > >was created (which is an entirely different ioctl).
> > >
> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> > >already at -6 without these changes. I think after vm_bind we'll need
> > >to create a -11 entry just to deal with this ioctl.
> > >
> >
> > The only change here is removing the execlist support for VM_BIND
> > mode (other than natual extensions).
> > Adding a new execbuffer3 was considered, but I think we need to be careful
> > with that as that goes beyond the VM_BIND support, including any future
> > requirements (as we don't want an execbuffer4 after VM_BIND).
>
> Why not? it's not like adding extensions here is really that different
> than adding new ioctls.
>
> I definitely think this deserves an execbuffer3 without even
> considering future requirements. Just  to burn down the old
> requirements and pointless fields.
>
> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
> older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less
midlayer. If we share the entry point then there's quite a few vfuncs
needed to cleanly split out the vm_bind paths from the legacy
reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl
vfunc, and then we share code (where it even makes sense, probably
request setup/submit need to be shared, anything else is probably
cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like
i915_active on the vma or vma open counts and all that stuff leaks
into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier,
or at least more flexible, since it should make it easier to have the
upstream vm_bind co-exist with all the other things we have. Without
huge amounts of conflicts (or at least much less) that pushing a pile
of vfuncs into the existing code would cause.

So maybe we should do this?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-01  9:27           ` Daniel Vetter
  0 siblings, 0 replies; 121+ messages in thread
From: Daniel Vetter @ 2022-06-01  9:27 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>
> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com> wrote:
> >
> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> > >> VM_BIND and related uapi definitions
> > >>
> > >> v2: Ensure proper kernel-doc formatting with cross references.
> > >>     Also add new uapi and documentation as per review comments
> > >>     from Daniel.
> > >>
> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > >> ---
> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
> > >>  1 file changed, 399 insertions(+)
> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> > >>
> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> new file mode 100644
> > >> index 000000000000..589c0a009107
> > >> --- /dev/null
> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> @@ -0,0 +1,399 @@
> > >> +/* SPDX-License-Identifier: MIT */
> > >> +/*
> > >> + * Copyright © 2022 Intel Corporation
> > >> + */
> > >> +
> > >> +/**
> > >> + * DOC: I915_PARAM_HAS_VM_BIND
> > >> + *
> > >> + * VM_BIND feature availability.
> > >> + * See typedef drm_i915_getparam_t param.
> > >> + */
> > >> +#define I915_PARAM_HAS_VM_BIND               57
> > >> +
> > >> +/**
> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> > >> + *
> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> > >> + * See struct drm_i915_gem_vm_control flags.
> > >> + *
> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> > >> + * to pass in the batch buffer addresses.
> > >> + *
> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> > >> + */
> > >
> > >From that description, it seems we have:
> > >
> > >struct drm_i915_gem_execbuffer2 {
> > >        __u64 buffers_ptr;              -> must be 0 (new)
> > >        __u32 buffer_count;             -> must be 0 (new)
> > >        __u32 batch_start_offset;       -> must be 0 (new)
> > >        __u32 batch_len;                -> must be 0 (new)
> > >        __u32 DR1;                      -> must be 0 (old)
> > >        __u32 DR4;                      -> must be 0 (old)
> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
> > >        __u64 flags;                    -> some flags must be 0 (new)
> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
> > >        __u64 rsvd2;                    -> unused
> > >};
> > >
> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
> > >of adding even more complexity to an already abused interface? While
> > >the Vulkan-like extension thing is really nice, I don't think what
> > >we're doing here is extending the ioctl usage, we're completely
> > >changing how the base struct should be interpreted based on how the VM
> > >was created (which is an entirely different ioctl).
> > >
> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> > >already at -6 without these changes. I think after vm_bind we'll need
> > >to create a -11 entry just to deal with this ioctl.
> > >
> >
> > The only change here is removing the execlist support for VM_BIND
> > mode (other than natual extensions).
> > Adding a new execbuffer3 was considered, but I think we need to be careful
> > with that as that goes beyond the VM_BIND support, including any future
> > requirements (as we don't want an execbuffer4 after VM_BIND).
>
> Why not? it's not like adding extensions here is really that different
> than adding new ioctls.
>
> I definitely think this deserves an execbuffer3 without even
> considering future requirements. Just  to burn down the old
> requirements and pointless fields.
>
> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
> older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less
midlayer. If we share the entry point then there's quite a few vfuncs
needed to cleanly split out the vm_bind paths from the legacy
reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl
vfunc, and then we share code (where it even makes sense, probably
request setup/submit need to be shared, anything else is probably
cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like
i915_active on the vma or vma open counts and all that stuff leaks
into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier,
or at least more flexible, since it should make it easier to have the
upstream vm_bind co-exist with all the other things we have. Without
huge amounts of conflicts (or at least much less) that pushing a pile
of vfuncs into the existing code would cause.

So maybe we should do this?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32   ` Niranjana Vishwanathapura
  (?)
  (?)
@ 2022-06-01 14:25   ` Lionel Landwerlin
  2022-06-01 20:28       ` Matthew Brost
  2022-06-01 21:18       ` Matthew Brost
  -1 siblings, 2 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-01 14:25 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, christian.koenig, chris.p.wilson

On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start 
binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's 
not immediate.


I have a question on the behavior of the bind operation when no input 
fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)


In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order


One thing I didn't realize is that because we only get one "VM_BIND" 
engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.


I guess we can deal with that scenario in userspace by doing the wait 
ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.


Daniel : what do you think? Should be rework this or just deal with wait 
fences in userspace?


Sorry I noticed this late.


-Lionel



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 14:25   ` Lionel Landwerlin
@ 2022-06-01 20:28       ` Matthew Brost
  2022-06-01 21:18       ` Matthew Brost
  1 sibling, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 20:28 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in
the case of the i915 likely this is a gem context handle) and binds
ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources
so if a different UMD can live with binds being ordered within the VM
they can use a mode consuming less resources.

Matt

> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-01 20:28       ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 20:28 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in
the case of the i915 likely this is a gem context handle) and binds
ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources
so if a different UMD can live with binds being ordered within the VM
they can use a mode consuming less resources.

Matt

> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 14:25   ` Lionel Landwerlin
@ 2022-06-01 21:18       ` Matthew Brost
  2022-06-01 21:18       ` Matthew Brost
  1 sibling, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
signaled before the exec starts)?

Matt

> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 
> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-01 21:18       ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
signaled before the exec starts)?

Matt

> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 
> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32   ` Niranjana Vishwanathapura
@ 2022-06-02  2:13     ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-02  2:13 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, intel-gfx, dri-devel, Vetter,  Daniel
  Cc: Brost, Matthew, Hellstrom, Thomas, Wilson, Chris P, jason,
	christian.koenig



Regards,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Niranjana Vishwanathapura
> Sent: May 17, 2022 2:32 PM
> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst   |   2 +
>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  3 files changed, 310 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> api/dma-buf.rst
> index 36a76cbe9095..64cb924ec5bb 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
> 
> +.. _indefinite_dma_fences:
> +
>  Indefinite DMA Fences
>  ~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct
> drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
> an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards,
Oak 


> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)
> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct
> drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.
> +
> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
> not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.
> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
> future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults
> manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page
> faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of
> these
> +locks. There we will simply smash the new batch buffer address into the ring
> and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a
> reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each
> execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
> dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
> the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
> Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the
> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
> opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
> during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence)
> proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to
> wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume
> work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with
> GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier
> execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
> mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely
> managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem
> BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02  2:13     ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-02  2:13 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, intel-gfx, dri-devel, Vetter,  Daniel
  Cc: Hellstrom, Thomas, Wilson, Chris P, christian.koenig



Regards,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Niranjana Vishwanathapura
> Sent: May 17, 2022 2:32 PM
> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst   |   2 +
>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  3 files changed, 310 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> api/dma-buf.rst
> index 36a76cbe9095..64cb924ec5bb 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
> 
> +.. _indefinite_dma_fences:
> +
>  Indefinite DMA Fences
>  ~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct
> drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
> an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards,
Oak 


> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)
> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct
> drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.
> +
> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
> not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.
> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
> future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults
> manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page
> faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of
> these
> +locks. There we will simply smash the new batch buffer address into the ring
> and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a
> reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each
> execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
> dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
> the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
> Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the
> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
> opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
> during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence)
> proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to
> wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume
> work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with
> GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier
> execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
> mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely
> managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem
> BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-01  9:27           ` Daniel Vetter
@ 2022-06-02  5:08             ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02  5:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>
>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com> wrote:
>> >
>> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> > >> VM_BIND and related uapi definitions
>> > >>
>> > >> v2: Ensure proper kernel-doc formatting with cross references.
>> > >>     Also add new uapi and documentation as per review comments
>> > >>     from Daniel.
>> > >>
>> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> > >> ---
>> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>> > >>  1 file changed, 399 insertions(+)
>> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>> > >>
>> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> new file mode 100644
>> > >> index 000000000000..589c0a009107
>> > >> --- /dev/null
>> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> @@ -0,0 +1,399 @@
>> > >> +/* SPDX-License-Identifier: MIT */
>> > >> +/*
>> > >> + * Copyright © 2022 Intel Corporation
>> > >> + */
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_PARAM_HAS_VM_BIND
>> > >> + *
>> > >> + * VM_BIND feature availability.
>> > >> + * See typedef drm_i915_getparam_t param.
>> > >> + */
>> > >> +#define I915_PARAM_HAS_VM_BIND               57
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> > >> + *
>> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> > >> + * See struct drm_i915_gem_vm_control flags.
>> > >> + *
>> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>> > >> + * to pass in the batch buffer addresses.
>> > >> + *
>> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>> > >> + */
>> > >
>> > >From that description, it seems we have:
>> > >
>> > >struct drm_i915_gem_execbuffer2 {
>> > >        __u64 buffers_ptr;              -> must be 0 (new)
>> > >        __u32 buffer_count;             -> must be 0 (new)
>> > >        __u32 batch_start_offset;       -> must be 0 (new)
>> > >        __u32 batch_len;                -> must be 0 (new)
>> > >        __u32 DR1;                      -> must be 0 (old)
>> > >        __u32 DR4;                      -> must be 0 (old)
>> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>> > >        __u64 flags;                    -> some flags must be 0 (new)
>> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
>> > >        __u64 rsvd2;                    -> unused
>> > >};
>> > >
>> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>> > >of adding even more complexity to an already abused interface? While
>> > >the Vulkan-like extension thing is really nice, I don't think what
>> > >we're doing here is extending the ioctl usage, we're completely
>> > >changing how the base struct should be interpreted based on how the VM
>> > >was created (which is an entirely different ioctl).
>> > >
>> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>> > >already at -6 without these changes. I think after vm_bind we'll need
>> > >to create a -11 entry just to deal with this ioctl.
>> > >
>> >
>> > The only change here is removing the execlist support for VM_BIND
>> > mode (other than natual extensions).
>> > Adding a new execbuffer3 was considered, but I think we need to be careful
>> > with that as that goes beyond the VM_BIND support, including any future
>> > requirements (as we don't want an execbuffer4 after VM_BIND).
>>
>> Why not? it's not like adding extensions here is really that different
>> than adding new ioctls.
>>
>> I definitely think this deserves an execbuffer3 without even
>> considering future requirements. Just  to burn down the old
>> requirements and pointless fields.
>>
>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>> older sw on execbuf2 for ever.
>
>I guess another point in favour of execbuf3 would be that it's less
>midlayer. If we share the entry point then there's quite a few vfuncs
>needed to cleanly split out the vm_bind paths from the legacy
>reloc/softping paths.
>
>If we invert this and do execbuf3, then there's the existing ioctl
>vfunc, and then we share code (where it even makes sense, probably
>request setup/submit need to be shared, anything else is probably
>cleaner to just copypaste) with the usual helper approach.
>
>Also that would guarantee that really none of the old concepts like
>i915_active on the vma or vma open counts and all that stuff leaks
>into the new vm_bind execbuf.
>
>Finally I also think that copypasting would make backporting easier,
>or at least more flexible, since it should make it easier to have the
>upstream vm_bind co-exist with all the other things we have. Without
>huge amounts of conflicts (or at least much less) that pushing a pile
>of vfuncs into the existing code would cause.
>
>So maybe we should do this?

Thanks Dave, Daniel.
There are a few things that will be common between execbuf2 and
execbuf3, like request setup/submit (as you said), fence handling 
(timeline fences, fence array, composite fences), engine selection,
etc. Also, many of the 'flags' will be there in execbuf3 also (but
bit position will differ).
But I guess these should be fine as the suggestion here is to
copy-paste the execbuff code and having a shared code where possible.
Besides, we can stop supporting some older feature in execbuff3
(like fence array in favor of newer timeline fences), which will
further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Niranjana

>-Daniel
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-02  5:08             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02  5:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, Dave Airlie, christian.koenig

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>
>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com> wrote:
>> >
>> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> > >> VM_BIND and related uapi definitions
>> > >>
>> > >> v2: Ensure proper kernel-doc formatting with cross references.
>> > >>     Also add new uapi and documentation as per review comments
>> > >>     from Daniel.
>> > >>
>> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> > >> ---
>> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>> > >>  1 file changed, 399 insertions(+)
>> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>> > >>
>> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> new file mode 100644
>> > >> index 000000000000..589c0a009107
>> > >> --- /dev/null
>> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> @@ -0,0 +1,399 @@
>> > >> +/* SPDX-License-Identifier: MIT */
>> > >> +/*
>> > >> + * Copyright © 2022 Intel Corporation
>> > >> + */
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_PARAM_HAS_VM_BIND
>> > >> + *
>> > >> + * VM_BIND feature availability.
>> > >> + * See typedef drm_i915_getparam_t param.
>> > >> + */
>> > >> +#define I915_PARAM_HAS_VM_BIND               57
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> > >> + *
>> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> > >> + * See struct drm_i915_gem_vm_control flags.
>> > >> + *
>> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>> > >> + * to pass in the batch buffer addresses.
>> > >> + *
>> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>> > >> + */
>> > >
>> > >From that description, it seems we have:
>> > >
>> > >struct drm_i915_gem_execbuffer2 {
>> > >        __u64 buffers_ptr;              -> must be 0 (new)
>> > >        __u32 buffer_count;             -> must be 0 (new)
>> > >        __u32 batch_start_offset;       -> must be 0 (new)
>> > >        __u32 batch_len;                -> must be 0 (new)
>> > >        __u32 DR1;                      -> must be 0 (old)
>> > >        __u32 DR4;                      -> must be 0 (old)
>> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>> > >        __u64 flags;                    -> some flags must be 0 (new)
>> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
>> > >        __u64 rsvd2;                    -> unused
>> > >};
>> > >
>> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>> > >of adding even more complexity to an already abused interface? While
>> > >the Vulkan-like extension thing is really nice, I don't think what
>> > >we're doing here is extending the ioctl usage, we're completely
>> > >changing how the base struct should be interpreted based on how the VM
>> > >was created (which is an entirely different ioctl).
>> > >
>> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>> > >already at -6 without these changes. I think after vm_bind we'll need
>> > >to create a -11 entry just to deal with this ioctl.
>> > >
>> >
>> > The only change here is removing the execlist support for VM_BIND
>> > mode (other than natual extensions).
>> > Adding a new execbuffer3 was considered, but I think we need to be careful
>> > with that as that goes beyond the VM_BIND support, including any future
>> > requirements (as we don't want an execbuffer4 after VM_BIND).
>>
>> Why not? it's not like adding extensions here is really that different
>> than adding new ioctls.
>>
>> I definitely think this deserves an execbuffer3 without even
>> considering future requirements. Just  to burn down the old
>> requirements and pointless fields.
>>
>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>> older sw on execbuf2 for ever.
>
>I guess another point in favour of execbuf3 would be that it's less
>midlayer. If we share the entry point then there's quite a few vfuncs
>needed to cleanly split out the vm_bind paths from the legacy
>reloc/softping paths.
>
>If we invert this and do execbuf3, then there's the existing ioctl
>vfunc, and then we share code (where it even makes sense, probably
>request setup/submit need to be shared, anything else is probably
>cleaner to just copypaste) with the usual helper approach.
>
>Also that would guarantee that really none of the old concepts like
>i915_active on the vma or vma open counts and all that stuff leaks
>into the new vm_bind execbuf.
>
>Finally I also think that copypasting would make backporting easier,
>or at least more flexible, since it should make it easier to have the
>upstream vm_bind co-exist with all the other things we have. Without
>huge amounts of conflicts (or at least much less) that pushing a pile
>of vfuncs into the existing code would cause.
>
>So maybe we should do this?

Thanks Dave, Daniel.
There are a few things that will be common between execbuf2 and
execbuf3, like request setup/submit (as you said), fence handling 
(timeline fences, fence array, composite fences), engine selection,
etc. Also, many of the 'flags' will be there in execbuf3 also (but
bit position will differ).
But I guess these should be fine as the suggestion here is to
copy-paste the execbuff code and having a shared code where possible.
Besides, we can stop supporting some older feature in execbuff3
(like fence array in favor of newer timeline fences), which will
further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Niranjana

>-Daniel
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 21:18       ` Matthew Brost
@ 2022-06-02  5:42         ` Lionel Landwerlin
  -1 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-02  5:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On 02/06/2022 00:18, Matthew Brost wrote:
> On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>> +async worker. The binding and unbinding will work like a special GPU engine.
>>> +The binding and unbinding operations are serialized and will wait on specified
>>> +input fences before the operation and will signal the output fences upon the
>>> +completion of the operation. Due to serialization, completion of an operation
>>> +will also indicate that all previous operations are also complete.
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>
>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>
>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
> Question - let's say this done after the above operations:
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>
> Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> signaled before the exec starts)?
>
> Matt


Hi Matt,

 From the vulkan point of view, everything is serialized within an 
engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.


To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


EXEC will wait until fence2 is signaled.
Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel


>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02  5:42         ` Lionel Landwerlin
  0 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-02  5:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On 02/06/2022 00:18, Matthew Brost wrote:
> On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>> +async worker. The binding and unbinding will work like a special GPU engine.
>>> +The binding and unbinding operations are serialized and will wait on specified
>>> +input fences before the operation and will signal the output fences upon the
>>> +completion of the operation. Due to serialization, completion of an operation
>>> +will also indicate that all previous operations are also complete.
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>
>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>
>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
> Question - let's say this done after the above operations:
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>
> Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> signaled before the exec starts)?
>
> Matt


Hi Matt,

 From the vulkan point of view, everything is serialized within an 
engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.


To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


EXEC will wait until fence2 is signaled.
Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel


>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02  5:42         ` Lionel Landwerlin
@ 2022-06-02 16:22           ` Matthew Brost
  -1 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-02 16:22 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > > > +async worker. The binding and unbinding will work like a special GPU engine.
> > > > +The binding and unbinding operations are serialized and will wait on specified
> > > > +input fences before the operation and will signal the output fences upon the
> > > > +completion of the operation. Due to serialization, completion of an operation
> > > > +will also indicate that all previous operations are also complete.
> > > I guess we should avoid saying "will immediately start binding/unbinding" if
> > > there are fences involved.
> > > 
> > > And the fact that it's happening in an async worker seem to imply it's not
> > > immediate.
> > > 
> > > 
> > > I have a question on the behavior of the bind operation when no input fence
> > > is provided. Let say I do :
> > > 
> > > VM_BIND (out_fence=fence1)
> > > 
> > > VM_BIND (out_fence=fence2)
> > > 
> > > VM_BIND (out_fence=fence3)
> > > 
> > > 
> > > In what order are the fences going to be signaled?
> > > 
> > > In the order of VM_BIND ioctls? Or out of order?
> > > 
> > > Because you wrote "serialized I assume it's : in order
> > > 
> > > 
> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
> > > there is a disconnect from the Vulkan specification.
> > > 
> > > In Vulkan VM_BIND operations are serialized but per engine.
> > > 
> > > So you could have something like this :
> > > 
> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> > > 
> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> > > 
> > Question - let's say this done after the above operations:
> > 
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> > 
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> > 
> > Matt
> 
> 
> Hi Matt,
> 
> From the vulkan point of view, everything is serialized within an engine (we
> map that to a VkQueue).
> 
> So with :
> 
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> EXEC completes first then VM_BIND executes.
> 
> 
> To be even clearer :
> 
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
> 
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
> 

Yea this makes sense. I think of VM_BINDs as more or less just another
version of an EXEC and this fits with that.

In practice I don't think we can share a ring but we should be able to
present an engine (again likely a gem context in i915) to the user that
orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

Hopefully Niranjana + Daniel agree.

Matt

> -Lionel
> 
> 
> > 
> > > fence1 is not signaled
> > > 
> > > fence3 is signaled
> > > 
> > > So the second VM_BIND will proceed before the first VM_BIND.
> > > 
> > > 
> > > I guess we can deal with that scenario in userspace by doing the wait
> > > ourselves in one thread per engines.
> > > 
> > > But then it makes the VM_BIND input fences useless.
> > > 
> > > 
> > > Daniel : what do you think? Should be rework this or just deal with wait
> > > fences in userspace?
> > > 
> > > 
> > > Sorry I noticed this late.
> > > 
> > > 
> > > -Lionel
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 16:22           ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-02 16:22 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > > > +async worker. The binding and unbinding will work like a special GPU engine.
> > > > +The binding and unbinding operations are serialized and will wait on specified
> > > > +input fences before the operation and will signal the output fences upon the
> > > > +completion of the operation. Due to serialization, completion of an operation
> > > > +will also indicate that all previous operations are also complete.
> > > I guess we should avoid saying "will immediately start binding/unbinding" if
> > > there are fences involved.
> > > 
> > > And the fact that it's happening in an async worker seem to imply it's not
> > > immediate.
> > > 
> > > 
> > > I have a question on the behavior of the bind operation when no input fence
> > > is provided. Let say I do :
> > > 
> > > VM_BIND (out_fence=fence1)
> > > 
> > > VM_BIND (out_fence=fence2)
> > > 
> > > VM_BIND (out_fence=fence3)
> > > 
> > > 
> > > In what order are the fences going to be signaled?
> > > 
> > > In the order of VM_BIND ioctls? Or out of order?
> > > 
> > > Because you wrote "serialized I assume it's : in order
> > > 
> > > 
> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
> > > there is a disconnect from the Vulkan specification.
> > > 
> > > In Vulkan VM_BIND operations are serialized but per engine.
> > > 
> > > So you could have something like this :
> > > 
> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> > > 
> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> > > 
> > Question - let's say this done after the above operations:
> > 
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> > 
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> > 
> > Matt
> 
> 
> Hi Matt,
> 
> From the vulkan point of view, everything is serialized within an engine (we
> map that to a VkQueue).
> 
> So with :
> 
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> EXEC completes first then VM_BIND executes.
> 
> 
> To be even clearer :
> 
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
> 
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
> 

Yea this makes sense. I think of VM_BINDs as more or less just another
version of an EXEC and this fits with that.

In practice I don't think we can share a ring but we should be able to
present an engine (again likely a gem context in i915) to the user that
orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

Hopefully Niranjana + Daniel agree.

Matt

> -Lionel
> 
> 
> > 
> > > fence1 is not signaled
> > > 
> > > fence3 is signaled
> > > 
> > > So the second VM_BIND will proceed before the first VM_BIND.
> > > 
> > > 
> > > I guess we can deal with that scenario in userspace by doing the wait
> > > ourselves in one thread per engines.
> > > 
> > > But then it makes the VM_BIND input fences useless.
> > > 
> > > 
> > > Daniel : what do you think? Should be rework this or just deal with wait
> > > fences in userspace?
> > > 
> > > 
> > > Sorry I noticed this late.
> > > 
> > > 
> > > -Lionel
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 20:28       ` Matthew Brost
@ 2022-06-02 20:11         ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	Lionel Landwerlin, daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a special GPU engine.
>> > +The binding and unbinding operations are serialized and will wait on specified
>> > +input fences before the operation and will signal the output fences upon the
>> > +completion of the operation. Due to serialization, completion of an operation
>> > +will also indicate that all previous operations are also complete.
>>
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred until next execbuff.
But now it is non-deferred (immediate in that sense). But yah, this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in the order of submission.
User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these
queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it
is really helping with the implementation.

Daniel, any thoughts?

Niranjana

>Matt
>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:11         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a special GPU engine.
>> > +The binding and unbinding operations are serialized and will wait on specified
>> > +input fences before the operation and will signal the output fences upon the
>> > +completion of the operation. Due to serialization, completion of an operation
>> > +will also indicate that all previous operations are also complete.
>>
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred until next execbuff.
But now it is non-deferred (immediate in that sense). But yah, this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in the order of submission.
User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these
queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it
is really helping with the implementation.

Daniel, any thoughts?

Niranjana

>Matt
>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02  5:42         ` Lionel Landwerlin
@ 2022-06-02 20:16           ` Bas Nieuwenhuizen
  -1 siblings, 0 replies; 121+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-02 20:16 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Matthew Brost, Intel Graphics Development, chris.p.wilson,
	Thomas Hellström, ML dri-devel, Daniel Vetter,
	Niranjana Vishwanathapura, Koenig, Christian

On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin
<lionel.g.landwerlin@intel.com> wrote:
>
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> >>> +async worker. The binding and unbinding will work like a special GPU engine.
> >>> +The binding and unbinding operations are serialized and will wait on specified
> >>> +input fences before the operation and will signal the output fences upon the
> >>> +completion of the operation. Due to serialization, completion of an operation
> >>> +will also indicate that all previous operations are also complete.
> >> I guess we should avoid saying "will immediately start binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's not
> >> immediate.
> >>
> >>
> >> I have a question on the behavior of the bind operation when no input fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> >> there is a disconnect from the Vulkan specification.

Note that in Vulkan not every queue has to support sparse binding, so
one could consider a dedicated sparse binding only queue family.

> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> > Question - let's say this done after the above operations:
> >
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> >
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> >
> > Matt
>
>
> Hi Matt,
>
>  From the vulkan point of view, everything is serialized within an
> engine (we map that to a VkQueue).
>
> So with :
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
> EXEC completes first then VM_BIND executes.
>
>
> To be even clearer :
>
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
>
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>
> -Lionel
>
>
> >
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.

I posed the same question on my series for AMD
(https://patchwork.freedesktop.org/series/104578/), albeit for
slightly different reasons.: if one creates a new VkMemory object, you
generally want that mapped ASAP, as you can't track (in a
VK_KHR_descriptor_indexing world) whether the next submit is going to
use this VkMemory object and hence have to assume the worst (i.e. wait
till the map/bind is complete before executing the next submission).
If all binds/unbinds (or maps/unmaps) happen in-order that means an
operation with input fences could delay stuff we want ASAP.

Of course waiting in userspace does have disadvantages:

1) more overhead between fence signalling and the operation,
potentially causing slightly bigger GPU bubbles.
2) You can't get an out fence early. Within the driver we can mostly
work around this but sync_fd exports, WSI and such will be messy.
3) moving the queue to a thread might make things slightly less ideal
due to scheduling delays.

Removing the in-order working in the kernel generally seems like
madness to me as it is very hard to keep track of the state of the
virtual address space (to e.g. track umapping stuff before freeing
memory or moving memory around)

the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:

separate queue:
1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore
2) sparse bind with input semaphore from 1 and 1 output semaphore
3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence
4) wait on that fence on the CPU

which works very well if we just wait for the sparse bind input
semaphore in userspace, but I'm still working on seeing if this is the
common usecase or an outlier.



> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:16           ` Bas Nieuwenhuizen
  0 siblings, 0 replies; 121+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-02 20:16 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel Graphics Development, chris.p.wilson,
	Thomas Hellström, ML dri-devel, Daniel Vetter, Koenig,
	Christian

On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin
<lionel.g.landwerlin@intel.com> wrote:
>
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> >>> +async worker. The binding and unbinding will work like a special GPU engine.
> >>> +The binding and unbinding operations are serialized and will wait on specified
> >>> +input fences before the operation and will signal the output fences upon the
> >>> +completion of the operation. Due to serialization, completion of an operation
> >>> +will also indicate that all previous operations are also complete.
> >> I guess we should avoid saying "will immediately start binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's not
> >> immediate.
> >>
> >>
> >> I have a question on the behavior of the bind operation when no input fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> >> there is a disconnect from the Vulkan specification.

Note that in Vulkan not every queue has to support sparse binding, so
one could consider a dedicated sparse binding only queue family.

> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> > Question - let's say this done after the above operations:
> >
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> >
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> >
> > Matt
>
>
> Hi Matt,
>
>  From the vulkan point of view, everything is serialized within an
> engine (we map that to a VkQueue).
>
> So with :
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
> EXEC completes first then VM_BIND executes.
>
>
> To be even clearer :
>
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
>
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>
> -Lionel
>
>
> >
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.

I posed the same question on my series for AMD
(https://patchwork.freedesktop.org/series/104578/), albeit for
slightly different reasons.: if one creates a new VkMemory object, you
generally want that mapped ASAP, as you can't track (in a
VK_KHR_descriptor_indexing world) whether the next submit is going to
use this VkMemory object and hence have to assume the worst (i.e. wait
till the map/bind is complete before executing the next submission).
If all binds/unbinds (or maps/unmaps) happen in-order that means an
operation with input fences could delay stuff we want ASAP.

Of course waiting in userspace does have disadvantages:

1) more overhead between fence signalling and the operation,
potentially causing slightly bigger GPU bubbles.
2) You can't get an out fence early. Within the driver we can mostly
work around this but sync_fd exports, WSI and such will be messy.
3) moving the queue to a thread might make things slightly less ideal
due to scheduling delays.

Removing the in-order working in the kernel generally seems like
madness to me as it is very hard to keep track of the state of the
virtual address space (to e.g. track umapping stuff before freeing
memory or moving memory around)

the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:

separate queue:
1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore
2) sparse bind with input semaphore from 1 and 1 output semaphore
3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence
4) wait on that fence on the CPU

which works very well if we just wait for the sparse bind input
semaphore in userspace, but I'm still working on seeing if this is the
common usecase or an outlier.



> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 16:22           ` Matthew Brost
@ 2022-06-02 20:24             ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:24 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	Lionel Landwerlin, daniel.vetter, christian.koenig

On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:
>On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
>> On 02/06/2022 00:18, Matthew Brost wrote:
>> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > > > +async worker. The binding and unbinding will work like a special GPU engine.
>> > > > +The binding and unbinding operations are serialized and will wait on specified
>> > > > +input fences before the operation and will signal the output fences upon the
>> > > > +completion of the operation. Due to serialization, completion of an operation
>> > > > +will also indicate that all previous operations are also complete.
>> > > I guess we should avoid saying "will immediately start binding/unbinding" if
>> > > there are fences involved.
>> > >
>> > > And the fact that it's happening in an async worker seem to imply it's not
>> > > immediate.
>> > >
>> > >
>> > > I have a question on the behavior of the bind operation when no input fence
>> > > is provided. Let say I do :
>> > >
>> > > VM_BIND (out_fence=fence1)
>> > >
>> > > VM_BIND (out_fence=fence2)
>> > >
>> > > VM_BIND (out_fence=fence3)
>> > >
>> > >
>> > > In what order are the fences going to be signaled?
>> > >
>> > > In the order of VM_BIND ioctls? Or out of order?
>> > >
>> > > Because you wrote "serialized I assume it's : in order
>> > >
>> > >
>> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> > > there is a disconnect from the Vulkan specification.
>> > >
>> > > In Vulkan VM_BIND operations are serialized but per engine.
>> > >
>> > > So you could have something like this :
>> > >
>> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>> > >
>> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>> > >
>> > Question - let's say this done after the above operations:
>> >
>> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> >
>> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
>> > signaled before the exec starts)?
>> >
>> > Matt
>>
>>
>> Hi Matt,
>>
>> From the vulkan point of view, everything is serialized within an engine (we
>> map that to a VkQueue).
>>
>> So with :
>>
>> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>> EXEC completes first then VM_BIND executes.
>>
>>
>> To be even clearer :
>>
>> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> EXEC will wait until fence2 is signaled.
>> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>>
>> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>>
>
>Yea this makes sense. I think of VM_BINDs as more or less just another
>version of an EXEC and this fits with that.
>

Note that VM_BIND itself can bind while and EXEC (GPU job) is running.
(Say, getting binds ready for next submission). It is up to user though,
how to use it.

>In practice I don't think we can share a ring but we should be able to
>present an engine (again likely a gem context in i915) to the user that
>orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
>

I have responded in the other thread on this.

Niranjana

>Hopefully Niranjana + Daniel agree.
>
>Matt
>
>> -Lionel
>>
>>
>> >
>> > > fence1 is not signaled
>> > >
>> > > fence3 is signaled
>> > >
>> > > So the second VM_BIND will proceed before the first VM_BIND.
>> > >
>> > >
>> > > I guess we can deal with that scenario in userspace by doing the wait
>> > > ourselves in one thread per engines.
>> > >
>> > > But then it makes the VM_BIND input fences useless.
>> > >
>> > >
>> > > Daniel : what do you think? Should be rework this or just deal with wait
>> > > fences in userspace?
>> > >
>> > >
>> > > Sorry I noticed this late.
>> > >
>> > >
>> > > -Lionel
>> > >
>> > >
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:24             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:24 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	daniel.vetter, christian.koenig

On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:
>On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
>> On 02/06/2022 00:18, Matthew Brost wrote:
>> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > > > +async worker. The binding and unbinding will work like a special GPU engine.
>> > > > +The binding and unbinding operations are serialized and will wait on specified
>> > > > +input fences before the operation and will signal the output fences upon the
>> > > > +completion of the operation. Due to serialization, completion of an operation
>> > > > +will also indicate that all previous operations are also complete.
>> > > I guess we should avoid saying "will immediately start binding/unbinding" if
>> > > there are fences involved.
>> > >
>> > > And the fact that it's happening in an async worker seem to imply it's not
>> > > immediate.
>> > >
>> > >
>> > > I have a question on the behavior of the bind operation when no input fence
>> > > is provided. Let say I do :
>> > >
>> > > VM_BIND (out_fence=fence1)
>> > >
>> > > VM_BIND (out_fence=fence2)
>> > >
>> > > VM_BIND (out_fence=fence3)
>> > >
>> > >
>> > > In what order are the fences going to be signaled?
>> > >
>> > > In the order of VM_BIND ioctls? Or out of order?
>> > >
>> > > Because you wrote "serialized I assume it's : in order
>> > >
>> > >
>> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> > > there is a disconnect from the Vulkan specification.
>> > >
>> > > In Vulkan VM_BIND operations are serialized but per engine.
>> > >
>> > > So you could have something like this :
>> > >
>> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>> > >
>> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>> > >
>> > Question - let's say this done after the above operations:
>> >
>> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> >
>> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
>> > signaled before the exec starts)?
>> >
>> > Matt
>>
>>
>> Hi Matt,
>>
>> From the vulkan point of view, everything is serialized within an engine (we
>> map that to a VkQueue).
>>
>> So with :
>>
>> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>> EXEC completes first then VM_BIND executes.
>>
>>
>> To be even clearer :
>>
>> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> EXEC will wait until fence2 is signaled.
>> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>>
>> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>>
>
>Yea this makes sense. I think of VM_BINDs as more or less just another
>version of an EXEC and this fits with that.
>

Note that VM_BIND itself can bind while and EXEC (GPU job) is running.
(Say, getting binds ready for next submission). It is up to user though,
how to use it.

>In practice I don't think we can share a ring but we should be able to
>present an engine (again likely a gem context in i915) to the user that
>orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
>

I have responded in the other thread on this.

Niranjana

>Hopefully Niranjana + Daniel agree.
>
>Matt
>
>> -Lionel
>>
>>
>> >
>> > > fence1 is not signaled
>> > >
>> > > fence3 is signaled
>> > >
>> > > So the second VM_BIND will proceed before the first VM_BIND.
>> > >
>> > >
>> > > I guess we can deal with that scenario in userspace by doing the wait
>> > > ourselves in one thread per engines.
>> > >
>> > > But then it makes the VM_BIND input fences useless.
>> > >
>> > >
>> > > Daniel : what do you think? Should be rework this or just deal with wait
>> > > fences in userspace?
>> > >
>> > >
>> > > Sorry I noticed this late.
>> > >
>> > >
>> > > -Lionel
>> > >
>> > >
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 20:11         ` Niranjana Vishwanathapura
@ 2022-06-02 20:35           ` Jason Ekstrand
  -1 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-02 20:35 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Matthew Brost, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Chris Wilson, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 7866 bytes --]

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in an
> >> > +async worker. The binding and unbinding will work like a special GPU
> engine.
> >> > +The binding and unbinding operations are serialized and will wait on
> specified
> >> > +input fences before the operation and will signal the output fences
> upon the
> >> > +completion of the operation. Due to serialization, completion of an
> operation
> >> > +will also indicate that all previous operations are also complete.
> >>
> >> I guess we should avoid saying "will immediately start
> binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's
> not
> >> immediate.
> >>
>
> Ok, will fix.
> This was added because in earlier design binding was deferred until next
> execbuff.
> But now it is non-deferred (immediate in that sense). But yah, this is
> confusing
> and will fix it.
>
> >>
> >> I have a question on the behavior of the bind operation when no input
> fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
>
> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will
> use
> the same queue and hence are ordered.
>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND"
> engine,
> >> there is a disconnect from the Vulkan specification.
> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> >>
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.
> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >
> >My opinion is rework this but make the ordering via an engine param
> optional.
> >
> >e.g. A VM can be configured so all binds are ordered within the VM
> >
> >e.g. A VM can be configured so all binds accept an engine argument (in
> >the case of the i915 likely this is a gem context handle) and binds
> >ordered with respect to that engine.
> >
> >This gives UMDs options as the later likely consumes more KMD resources
> >so if a different UMD can live with binds being ordered within the VM
> >they can use a mode consuming less resources.
> >
>
> I think we need to be careful here if we are looking for some out of
> (submission) order completion of vm_bind/unbind.
> In-order completion means, in a batch of binds and unbinds to be
> completed in-order, user only needs to specify in-fence for the
> first bind/unbind call and the our-fence for the last bind/unbind
> call. Also, the VA released by an unbind call can be re-used by
> any subsequent bind call in that in-order batch.
>
> These things will break if binding/unbinding were to be allowed to
> go out of order (of submission) and user need to be extra careful
> not to run into pre-mature triggereing of out-fence and bind failing
> as VA is still in use etc.
>
> Also, VM_BIND binds the provided mapping on the specified address space
> (VM). So, the uapi is not engine/context specific.
>
> We can however add a 'queue' to the uapi which can be one from the
> pre-defined queues,
> I915_VM_BIND_QUEUE_0
> I915_VM_BIND_QUEUE_1
> ...
> I915_VM_BIND_QUEUE_(N-1)
>
> KMD will spawn an async work queue for each queue which will only
> bind the mappings on that queue in the order of submission.
> User can assign the queue to per engine or anything like that.
>
> But again here, user need to be careful and not deadlock these
> queues with circular dependency of fences.
>
> I prefer adding this later an as extension based on whether it
> is really helping with the implementation.
>

I can tell you right now that having everything on a single in-order queue
will not get us the perf we want.  What vulkan really wants is one of two
things:

 1. No implicit ordering of VM_BIND ops.  They just happen in whatever
their dependencies are resolved and we ensure ordering ourselves by having
a syncobj in the VkQueue.

 2. The ability to create multiple VM_BIND queues.  We need at least 2 but
I don't see why there needs to be a limit besides the limits the i915 API
already has on the number of engines.  Vulkan could expose multiple sparse
binding queues to the client if it's not arbitrarily limited.

Why?  Because Vulkan has two basic kind of bind operations and we don't
want any dependencies between them:

 1. Immediate.  These happen right after BO creation or maybe as part of
vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a queue
and we don't want them serialized with anything.  To synchronize with
submit, we'll have a syncobj in the VkDevice which is signaled by all
immediate bind operations and make submits wait on it.

 2. Queued (sparse): These happen on a VkQueue which may be the same as a
render/compute queue or may be its own queue.  It's up to us what we want
to advertise.  From the Vulkan API PoV, this is like any other queue.
Operations on it wait on and signal semaphores.  If we have a VM_BIND
engine, we'd provide syncobjs to wait and signal just like we do in
execbuf().

The important thing is that we don't want one type of operation to block on
the other.  If immediate binds are blocking on sparse binds, it's going to
cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a
lock on the VM and that we can't actually do these things in parallel.
That's fine.  Once the dma_fences have signaled and we're unblocked to do
the bind operation, I don't care if there's a bit of synchronization due to
locking.  That's expected.  What we can't afford to have is an immediate
bind operation suddenly blocking on a sparse operation which is blocked on
a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging
queues (what they call a VM_BIND engine/queue).  That design works pretty
well and solves the problems in question.  Again, we could just make
everything out-of-order and require using syncobjs to order things as
userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND
queues waiting for syncobjs to materialize.  We don't really want/need
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and that
machinery is on by default.  It would actually take MORE work in Mesa to
turn it off and take advantage of the kernel being able to wait for
syncobjs to materialize.  Also, getting that right is ridiculously hard and
I really don't want to get it wrong in kernel space.  When we do memory
fences, wait-before-signal will be a thing.  We don't need to try and make
it a thing for syncobj.

--Jason


> Daniel, any thoughts?
>
> Niranjana
>
> >Matt
> >
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

[-- Attachment #2: Type: text/html, Size: 9607 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:35           ` Jason Ekstrand
  0 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-02 20:35 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

[-- Attachment #1: Type: text/plain, Size: 7866 bytes --]

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in an
> >> > +async worker. The binding and unbinding will work like a special GPU
> engine.
> >> > +The binding and unbinding operations are serialized and will wait on
> specified
> >> > +input fences before the operation and will signal the output fences
> upon the
> >> > +completion of the operation. Due to serialization, completion of an
> operation
> >> > +will also indicate that all previous operations are also complete.
> >>
> >> I guess we should avoid saying "will immediately start
> binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's
> not
> >> immediate.
> >>
>
> Ok, will fix.
> This was added because in earlier design binding was deferred until next
> execbuff.
> But now it is non-deferred (immediate in that sense). But yah, this is
> confusing
> and will fix it.
>
> >>
> >> I have a question on the behavior of the bind operation when no input
> fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
>
> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will
> use
> the same queue and hence are ordered.
>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND"
> engine,
> >> there is a disconnect from the Vulkan specification.
> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> >>
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.
> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >
> >My opinion is rework this but make the ordering via an engine param
> optional.
> >
> >e.g. A VM can be configured so all binds are ordered within the VM
> >
> >e.g. A VM can be configured so all binds accept an engine argument (in
> >the case of the i915 likely this is a gem context handle) and binds
> >ordered with respect to that engine.
> >
> >This gives UMDs options as the later likely consumes more KMD resources
> >so if a different UMD can live with binds being ordered within the VM
> >they can use a mode consuming less resources.
> >
>
> I think we need to be careful here if we are looking for some out of
> (submission) order completion of vm_bind/unbind.
> In-order completion means, in a batch of binds and unbinds to be
> completed in-order, user only needs to specify in-fence for the
> first bind/unbind call and the our-fence for the last bind/unbind
> call. Also, the VA released by an unbind call can be re-used by
> any subsequent bind call in that in-order batch.
>
> These things will break if binding/unbinding were to be allowed to
> go out of order (of submission) and user need to be extra careful
> not to run into pre-mature triggereing of out-fence and bind failing
> as VA is still in use etc.
>
> Also, VM_BIND binds the provided mapping on the specified address space
> (VM). So, the uapi is not engine/context specific.
>
> We can however add a 'queue' to the uapi which can be one from the
> pre-defined queues,
> I915_VM_BIND_QUEUE_0
> I915_VM_BIND_QUEUE_1
> ...
> I915_VM_BIND_QUEUE_(N-1)
>
> KMD will spawn an async work queue for each queue which will only
> bind the mappings on that queue in the order of submission.
> User can assign the queue to per engine or anything like that.
>
> But again here, user need to be careful and not deadlock these
> queues with circular dependency of fences.
>
> I prefer adding this later an as extension based on whether it
> is really helping with the implementation.
>

I can tell you right now that having everything on a single in-order queue
will not get us the perf we want.  What vulkan really wants is one of two
things:

 1. No implicit ordering of VM_BIND ops.  They just happen in whatever
their dependencies are resolved and we ensure ordering ourselves by having
a syncobj in the VkQueue.

 2. The ability to create multiple VM_BIND queues.  We need at least 2 but
I don't see why there needs to be a limit besides the limits the i915 API
already has on the number of engines.  Vulkan could expose multiple sparse
binding queues to the client if it's not arbitrarily limited.

Why?  Because Vulkan has two basic kind of bind operations and we don't
want any dependencies between them:

 1. Immediate.  These happen right after BO creation or maybe as part of
vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a queue
and we don't want them serialized with anything.  To synchronize with
submit, we'll have a syncobj in the VkDevice which is signaled by all
immediate bind operations and make submits wait on it.

 2. Queued (sparse): These happen on a VkQueue which may be the same as a
render/compute queue or may be its own queue.  It's up to us what we want
to advertise.  From the Vulkan API PoV, this is like any other queue.
Operations on it wait on and signal semaphores.  If we have a VM_BIND
engine, we'd provide syncobjs to wait and signal just like we do in
execbuf().

The important thing is that we don't want one type of operation to block on
the other.  If immediate binds are blocking on sparse binds, it's going to
cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a
lock on the VM and that we can't actually do these things in parallel.
That's fine.  Once the dma_fences have signaled and we're unblocked to do
the bind operation, I don't care if there's a bit of synchronization due to
locking.  That's expected.  What we can't afford to have is an immediate
bind operation suddenly blocking on a sparse operation which is blocked on
a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging
queues (what they call a VM_BIND engine/queue).  That design works pretty
well and solves the problems in question.  Again, we could just make
everything out-of-order and require using syncobjs to order things as
userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND
queues waiting for syncobjs to materialize.  We don't really want/need
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and that
machinery is on by default.  It would actually take MORE work in Mesa to
turn it off and take advantage of the kernel being able to wait for
syncobjs to materialize.  Also, getting that right is ridiculously hard and
I really don't want to get it wrong in kernel space.  When we do memory
fences, wait-before-signal will be a thing.  We don't need to try and make
it a thing for syncobj.

--Jason


> Daniel, any thoughts?
>
> Niranjana
>
> >Matt
> >
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

[-- Attachment #2: Type: text/html, Size: 9607 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02  2:13     ` [Intel-gfx] " Zeng, Oak
@ 2022-06-02 20:48       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:48 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Brost, Matthew, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, jason, Vetter, Daniel, christian.koenig

On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Niranjana Vishwanathapura
>> Sent: May 17, 2022 2:32 PM
>> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
>> Daniel <daniel.vetter@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
>> <chris.p.wilson@intel.com>; christian.koenig@amd.com
>> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
>>
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/driver-api/dma-buf.rst   |   2 +
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
>> +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  3 files changed, 310 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
>> api/dma-buf.rst
>> index 36a76cbe9095..64cb924ec5bb 100644
>> --- a/Documentation/driver-api/dma-buf.rst
>> +++ b/Documentation/driver-api/dma-buf.rst
>> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>>  .. kernel-doc:: include/linux/sync_file.h
>>     :internal:
>>
>> +.. _indefinite_dma_fences:
>> +
>>  Indefinite DMA Fences
>>  ~~~~~~~~~~~~~~~~~~~~~
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
>> b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
>> buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct
>> drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
>> extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
>> an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>
>Hi,
>
>Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
>Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
>

Thanks Oak,
Either should be fine and up to user how to use vm_bind/unbind out-fence.

>I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
>Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
>Restriction since vm bind can be finished in the fault handler?
>

With GPU page faults handler, out fence won't be needed as residency is
purely managed by page fault handler populating page tables (there is a
mention of it in GPU Page Faults section below).

>Should we document such thing?
>

We don't talk much about GPU page faults case in this document as that may
warrent a separate rfc when we add page faults support. We did mention it
in couple places to ensure our locking design here is extensible to gpu
page faults case.

Niranjana

>Regards,
>Oak
>
>
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct
>> drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>> +
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
>> not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
>> during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
>> future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults
>> manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page
>> faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of
>> these
>> +locks. There we will simply smash the new batch buffer address into the ring
>> and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a
>> reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each
>> execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
>> prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
>> dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
>> the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
>> Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware
>> performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the
>> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
>> completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
>> opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
>> during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence)
>> proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to
>> wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume
>> work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with
>> GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier
>> execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
>> mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely
>> managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem
>> BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
>> feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
>> using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:48       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:48 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	Daniel, christian.koenig

On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Niranjana Vishwanathapura
>> Sent: May 17, 2022 2:32 PM
>> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
>> Daniel <daniel.vetter@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
>> <chris.p.wilson@intel.com>; christian.koenig@amd.com
>> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
>>
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/driver-api/dma-buf.rst   |   2 +
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
>> +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  3 files changed, 310 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
>> api/dma-buf.rst
>> index 36a76cbe9095..64cb924ec5bb 100644
>> --- a/Documentation/driver-api/dma-buf.rst
>> +++ b/Documentation/driver-api/dma-buf.rst
>> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>>  .. kernel-doc:: include/linux/sync_file.h
>>     :internal:
>>
>> +.. _indefinite_dma_fences:
>> +
>>  Indefinite DMA Fences
>>  ~~~~~~~~~~~~~~~~~~~~~
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
>> b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
>> buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct
>> drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
>> extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
>> an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>
>Hi,
>
>Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
>Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
>

Thanks Oak,
Either should be fine and up to user how to use vm_bind/unbind out-fence.

>I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
>Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
>Restriction since vm bind can be finished in the fault handler?
>

With GPU page faults handler, out fence won't be needed as residency is
purely managed by page fault handler populating page tables (there is a
mention of it in GPU Page Faults section below).

>Should we document such thing?
>

We don't talk much about GPU page faults case in this document as that may
warrent a separate rfc when we add page faults support. We did mention it
in couple places to ensure our locking design here is extensible to gpu
page faults case.

Niranjana

>Regards,
>Oak
>
>
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct
>> drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>> +
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
>> not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
>> during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
>> future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults
>> manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page
>> faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of
>> these
>> +locks. There we will simply smash the new batch buffer address into the ring
>> and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a
>> reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each
>> execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
>> prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
>> dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
>> the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
>> Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware
>> performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the
>> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
>> completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
>> opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
>> during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence)
>> proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to
>> wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume
>> work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with
>> GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier
>> execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
>> mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely
>> managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem
>> BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
>> feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
>> using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-02  5:08             ` Niranjana Vishwanathapura
  (?)
@ 2022-06-03  6:53             ` Niranjana Vishwanathapura
  2022-06-07 10:42               ` Tvrtko Ursulin
                                 ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-03  6:53 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>
>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>> >> VM_BIND and related uapi definitions
>>>> >>
>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>> >>     Also add new uapi and documentation as per review comments
>>>> >>     from Daniel.
>>>> >>
>>>> >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>> >> ---
>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>> >>  1 file changed, 399 insertions(+)
>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>> >>
>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>>>> >> new file mode 100644
>>>> >> index 000000000000..589c0a009107
>>>> >> --- /dev/null
>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>> >> @@ -0,0 +1,399 @@
>>>> >> +/* SPDX-License-Identifier: MIT */
>>>> >> +/*
>>>> >> + * Copyright © 2022 Intel Corporation
>>>> >> + */
>>>> >> +
>>>> >> +/**
>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>> >> + *
>>>> >> + * VM_BIND feature availability.
>>>> >> + * See typedef drm_i915_getparam_t param.
>>>> >> + */
>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>> >> +
>>>> >> +/**
>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>> >> + *
>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>> >> + *
>>>> >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>>>> >> + * to pass in the batch buffer addresses.
>>>> >> + *
>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>>> >> + */
>>>> >
>>>> >From that description, it seems we have:
>>>> >
>>>> >struct drm_i915_gem_execbuffer2 {
>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>>>> >        __u64 flags;                    -> some flags must be 0 (new)
>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>> >        __u64 rsvd2;                    -> unused
>>>> >};
>>>> >
>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>>>> >of adding even more complexity to an already abused interface? While
>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>> >we're doing here is extending the ioctl usage, we're completely
>>>> >changing how the base struct should be interpreted based on how the VM
>>>> >was created (which is an entirely different ioctl).
>>>> >
>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>> >already at -6 without these changes. I think after vm_bind we'll need
>>>> >to create a -11 entry just to deal with this ioctl.
>>>> >
>>>>
>>>> The only change here is removing the execlist support for VM_BIND
>>>> mode (other than natual extensions).
>>>> Adding a new execbuffer3 was considered, but I think we need to be careful
>>>> with that as that goes beyond the VM_BIND support, including any future
>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>
>>>Why not? it's not like adding extensions here is really that different
>>>than adding new ioctls.
>>>
>>>I definitely think this deserves an execbuffer3 without even
>>>considering future requirements. Just  to burn down the old
>>>requirements and pointless fields.
>>>
>>>Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>older sw on execbuf2 for ever.
>>
>>I guess another point in favour of execbuf3 would be that it's less
>>midlayer. If we share the entry point then there's quite a few vfuncs
>>needed to cleanly split out the vm_bind paths from the legacy
>>reloc/softping paths.
>>
>>If we invert this and do execbuf3, then there's the existing ioctl
>>vfunc, and then we share code (where it even makes sense, probably
>>request setup/submit need to be shared, anything else is probably
>>cleaner to just copypaste) with the usual helper approach.
>>
>>Also that would guarantee that really none of the old concepts like
>>i915_active on the vma or vma open counts and all that stuff leaks
>>into the new vm_bind execbuf.
>>
>>Finally I also think that copypasting would make backporting easier,
>>or at least more flexible, since it should make it easier to have the
>>upstream vm_bind co-exist with all the other things we have. Without
>>huge amounts of conflicts (or at least much less) that pushing a pile
>>of vfuncs into the existing code would cause.
>>
>>So maybe we should do this?
>
>Thanks Dave, Daniel.
>There are a few things that will be common between execbuf2 and
>execbuf3, like request setup/submit (as you said), fence handling 
>(timeline fences, fence array, composite fences), engine selection,
>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>bit position will differ).
>But I guess these should be fine as the suggestion here is to
>copy-paste the execbuff code and having a shared code where possible.
>Besides, we can stop supporting some older feature in execbuff3
>(like fence array in favor of newer timeline fences), which will
>further reduce common code.
>
>Ok, I will update this series by adding execbuf3 and send out soon.
>

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {
        __u32 ctx_id;		/* previously execbuffer2.rsvd1 */

        __u32 batch_count;
        __u64 batch_addr_ptr;	/* Pointer to an array of batch gpu virtual addresses */

        __u64 flags;
#define I915_EXEC3_RING_MASK              (0x3f)
#define I915_EXEC3_DEFAULT                (0<<0)
#define I915_EXEC3_RENDER                 (1<<0)
#define I915_EXEC3_BSD                    (2<<0)
#define I915_EXEC3_BLT                    (3<<0)
#define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6)
#define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8)
#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10)
#define I915_EXEC3_FENCE_OUT            (1<<11)
#define I915_EXEC3_FENCE_SUBMIT         (1<<12)

        __u64 in_out_fence;		/* previously execbuffer2.rsvd2 */

        __u64 extensions;		/* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
};

With this, user can pass in batch addresses and count directly,
instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not
applicable to BM_BIND mode.
I have also removed fence array support (execbuffer2.cliprects_ptr)
as we have timeline fence array support. Is that fine?
Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

>Niranjana
>
>>-Daniel
>>-- 
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 20:35           ` Jason Ekstrand
  (?)
@ 2022-06-03  7:20           ` Lionel Landwerlin
  2022-06-03 23:51               ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-03  7:20 UTC (permalink / raw)
  To: Jason Ekstrand, Niranjana Vishwanathapura
  Cc: Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 9238 bytes --]

On 02/06/2022 23:35, Jason Ekstrand wrote:
> On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura 
> <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>     >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding the mapping in an
>     >> > +async worker. The binding and unbinding will work like a
>     special GPU engine.
>     >> > +The binding and unbinding operations are serialized and will
>     wait on specified
>     >> > +input fences before the operation and will signal the output
>     fences upon the
>     >> > +completion of the operation. Due to serialization,
>     completion of an operation
>     >> > +will also indicate that all previous operations are also
>     complete.
>     >>
>     >> I guess we should avoid saying "will immediately start
>     binding/unbinding" if
>     >> there are fences involved.
>     >>
>     >> And the fact that it's happening in an async worker seem to
>     imply it's not
>     >> immediate.
>     >>
>
>     Ok, will fix.
>     This was added because in earlier design binding was deferred
>     until next execbuff.
>     But now it is non-deferred (immediate in that sense). But yah,
>     this is confusing
>     and will fix it.
>
>     >>
>     >> I have a question on the behavior of the bind operation when no
>     input fence
>     >> is provided. Let say I do :
>     >>
>     >> VM_BIND (out_fence=fence1)
>     >>
>     >> VM_BIND (out_fence=fence2)
>     >>
>     >> VM_BIND (out_fence=fence3)
>     >>
>     >>
>     >> In what order are the fences going to be signaled?
>     >>
>     >> In the order of VM_BIND ioctls? Or out of order?
>     >>
>     >> Because you wrote "serialized I assume it's : in order
>     >>
>
>     Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind will use
>     the same queue and hence are ordered.
>
>     >>
>     >> One thing I didn't realize is that because we only get one
>     "VM_BIND" engine,
>     >> there is a disconnect from the Vulkan specification.
>     >>
>     >> In Vulkan VM_BIND operations are serialized but per engine.
>     >>
>     >> So you could have something like this :
>     >>
>     >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >>
>     >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >>
>     >>
>     >> fence1 is not signaled
>     >>
>     >> fence3 is signaled
>     >>
>     >> So the second VM_BIND will proceed before the first VM_BIND.
>     >>
>     >>
>     >> I guess we can deal with that scenario in userspace by doing
>     the wait
>     >> ourselves in one thread per engines.
>     >>
>     >> But then it makes the VM_BIND input fences useless.
>     >>
>     >>
>     >> Daniel : what do you think? Should be rework this or just deal
>     with wait
>     >> fences in userspace?
>     >>
>     >
>     >My opinion is rework this but make the ordering via an engine
>     param optional.
>     >
>     >e.g. A VM can be configured so all binds are ordered within the VM
>     >
>     >e.g. A VM can be configured so all binds accept an engine
>     argument (in
>     >the case of the i915 likely this is a gem context handle) and binds
>     >ordered with respect to that engine.
>     >
>     >This gives UMDs options as the later likely consumes more KMD
>     resources
>     >so if a different UMD can live with binds being ordered within the VM
>     >they can use a mode consuming less resources.
>     >
>
>     I think we need to be careful here if we are looking for some out of
>     (submission) order completion of vm_bind/unbind.
>     In-order completion means, in a batch of binds and unbinds to be
>     completed in-order, user only needs to specify in-fence for the
>     first bind/unbind call and the our-fence for the last bind/unbind
>     call. Also, the VA released by an unbind call can be re-used by
>     any subsequent bind call in that in-order batch.
>
>     These things will break if binding/unbinding were to be allowed to
>     go out of order (of submission) and user need to be extra careful
>     not to run into pre-mature triggereing of out-fence and bind failing
>     as VA is still in use etc.
>
>     Also, VM_BIND binds the provided mapping on the specified address
>     space
>     (VM). So, the uapi is not engine/context specific.
>
>     We can however add a 'queue' to the uapi which can be one from the
>     pre-defined queues,
>     I915_VM_BIND_QUEUE_0
>     I915_VM_BIND_QUEUE_1
>     ...
>     I915_VM_BIND_QUEUE_(N-1)
>
>     KMD will spawn an async work queue for each queue which will only
>     bind the mappings on that queue in the order of submission.
>     User can assign the queue to per engine or anything like that.
>
>     But again here, user need to be careful and not deadlock these
>     queues with circular dependency of fences.
>
>     I prefer adding this later an as extension based on whether it
>     is really helping with the implementation.
>
>
> I can tell you right now that having everything on a single in-order 
> queue will not get us the perf we want. What vulkan really wants is 
> one of two things:
>
>  1. No implicit ordering of VM_BIND ops.  They just happen in whatever 
> their dependencies are resolved and we ensure ordering ourselves by 
> having a syncobj in the VkQueue.
>
>  2. The ability to create multiple VM_BIND queues.  We need at least 2 
> but I don't see why there needs to be a limit besides the limits the 
> i915 API already has on the number of engines.  Vulkan could expose 
> multiple sparse binding queues to the client if it's not arbitrarily 
> limited.
>
> Why?  Because Vulkan has two basic kind of bind operations and we 
> don't want any dependencies between them:
>
>  1. Immediate.  These happen right after BO creation or maybe as part 
> of vkBindImageMemory() or VkBindBufferMemory().  These don't happen on 
> a queue and we don't want them serialized with anything.  To 
> synchronize with submit, we'll have a syncobj in the VkDevice which is 
> signaled by all immediate bind operations and make submits wait on it.
>
>  2. Queued (sparse): These happen on a VkQueue which may be the same 
> as a render/compute queue or may be its own queue.  It's up to us what 
> we want to advertise.  From the Vulkan API PoV, this is like any other 
> queue.  Operations on it wait on and signal semaphores.  If we have a 
> VM_BIND engine, we'd provide syncobjs to wait and signal just like we 
> do in execbuf().
>
> The important thing is that we don't want one type of operation to 
> block on the other.  If immediate binds are blocking on sparse binds, 
> it's going to cause over-synchronization issues.
>
> In terms of the internal implementation, I know that there's going to 
> be a lock on the VM and that we can't actually do these things in 
> parallel.  That's fine.  Once the dma_fences have signaled and we're 
> unblocked to do the bind operation, I don't care if there's a bit of 
> synchronization due to locking.  That's expected.  What we can't 
> afford to have is an immediate bind operation suddenly blocking on a 
> sparse operation which is blocked on a compute job that's going to run 
> for another 5ms.
>
> For reference, Windows solves this by allowing arbitrarily many paging 
> queues (what they call a VM_BIND engine/queue).  That design works 
> pretty well and solves the problems in question.  Again, we could just 
> make everything out-of-order and require using syncobjs to order 
> things as userspace wants. That'd be fine too.
>
> One more note while I'm here: danvet said something on IRC about 
> VM_BIND queues waiting for syncobjs to materialize.  We don't really 
> want/need this.  We already have all the machinery in userspace to 
> handle wait-before-signal and waiting for syncobj fences to 
> materialize and that machinery is on by default.  It would actually 
> take MORE work in Mesa to turn it off and take advantage of the kernel 
> being able to wait for syncobjs to materialize.  Also, getting that 
> right is ridiculously hard and I really don't want to get it wrong in 
> kernel space. When we do memory fences, wait-before-signal will be a 
> thing.  We don't need to try and make it a thing for syncobj.
>
> --Jason


Thanks Jason,


I missed the bit in the Vulkan spec that we're allowed to have a sparse 
queue that does not implement either graphics or compute operations :

    "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
    support in queue families that also include

      graphics and compute support, other implementations may only
    expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue

      family."


So it can all be all a vm_bind engine that just does bind/unbind 
operations.


But yes we need another engine for the immediate/non-sparse operations.


-Lionel


>     Daniel, any thoughts?
>
>     Niranjana
>
>     >Matt
>     >
>     >>
>     >> Sorry I noticed this late.
>     >>
>     >>
>     >> -Lionel
>     >>
>     >>
>

[-- Attachment #2: Type: text/html, Size: 14209 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-03  7:20           ` Lionel Landwerlin
@ 2022-06-03 23:51               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-03 23:51 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>   On 02/06/2022 23:35, Jason Ekstrand wrote:
>
>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
>       the mapping in an
>       >> > +async worker. The binding and unbinding will work like a special
>       GPU engine.
>       >> > +The binding and unbinding operations are serialized and will
>       wait on specified
>       >> > +input fences before the operation and will signal the output
>       fences upon the
>       >> > +completion of the operation. Due to serialization, completion of
>       an operation
>       >> > +will also indicate that all previous operations are also
>       complete.
>       >>
>       >> I guess we should avoid saying "will immediately start
>       binding/unbinding" if
>       >> there are fences involved.
>       >>
>       >> And the fact that it's happening in an async worker seem to imply
>       it's not
>       >> immediate.
>       >>
>
>       Ok, will fix.
>       This was added because in earlier design binding was deferred until
>       next execbuff.
>       But now it is non-deferred (immediate in that sense). But yah, this is
>       confusing
>       and will fix it.
>
>       >>
>       >> I have a question on the behavior of the bind operation when no
>       input fence
>       >> is provided. Let say I do :
>       >>
>       >> VM_BIND (out_fence=fence1)
>       >>
>       >> VM_BIND (out_fence=fence2)
>       >>
>       >> VM_BIND (out_fence=fence3)
>       >>
>       >>
>       >> In what order are the fences going to be signaled?
>       >>
>       >> In the order of VM_BIND ioctls? Or out of order?
>       >>
>       >> Because you wrote "serialized I assume it's : in order
>       >>
>
>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind
>       will use
>       the same queue and hence are ordered.
>
>       >>
>       >> One thing I didn't realize is that because we only get one
>       "VM_BIND" engine,
>       >> there is a disconnect from the Vulkan specification.
>       >>
>       >> In Vulkan VM_BIND operations are serialized but per engine.
>       >>
>       >> So you could have something like this :
>       >>
>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>       >>
>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>       >>
>       >>
>       >> fence1 is not signaled
>       >>
>       >> fence3 is signaled
>       >>
>       >> So the second VM_BIND will proceed before the first VM_BIND.
>       >>
>       >>
>       >> I guess we can deal with that scenario in userspace by doing the
>       wait
>       >> ourselves in one thread per engines.
>       >>
>       >> But then it makes the VM_BIND input fences useless.
>       >>
>       >>
>       >> Daniel : what do you think? Should be rework this or just deal with
>       wait
>       >> fences in userspace?
>       >>
>       >
>       >My opinion is rework this but make the ordering via an engine param
>       optional.
>       >
>       >e.g. A VM can be configured so all binds are ordered within the VM
>       >
>       >e.g. A VM can be configured so all binds accept an engine argument
>       (in
>       >the case of the i915 likely this is a gem context handle) and binds
>       >ordered with respect to that engine.
>       >
>       >This gives UMDs options as the later likely consumes more KMD
>       resources
>       >so if a different UMD can live with binds being ordered within the VM
>       >they can use a mode consuming less resources.
>       >
>
>       I think we need to be careful here if we are looking for some out of
>       (submission) order completion of vm_bind/unbind.
>       In-order completion means, in a batch of binds and unbinds to be
>       completed in-order, user only needs to specify in-fence for the
>       first bind/unbind call and the our-fence for the last bind/unbind
>       call. Also, the VA released by an unbind call can be re-used by
>       any subsequent bind call in that in-order batch.
>
>       These things will break if binding/unbinding were to be allowed to
>       go out of order (of submission) and user need to be extra careful
>       not to run into pre-mature triggereing of out-fence and bind failing
>       as VA is still in use etc.
>
>       Also, VM_BIND binds the provided mapping on the specified address
>       space
>       (VM). So, the uapi is not engine/context specific.
>
>       We can however add a 'queue' to the uapi which can be one from the
>       pre-defined queues,
>       I915_VM_BIND_QUEUE_0
>       I915_VM_BIND_QUEUE_1
>       ...
>       I915_VM_BIND_QUEUE_(N-1)
>
>       KMD will spawn an async work queue for each queue which will only
>       bind the mappings on that queue in the order of submission.
>       User can assign the queue to per engine or anything like that.
>
>       But again here, user need to be careful and not deadlock these
>       queues with circular dependency of fences.
>
>       I prefer adding this later an as extension based on whether it
>       is really helping with the implementation.
>
>     I can tell you right now that having everything on a single in-order
>     queue will not get us the perf we want.  What vulkan really wants is one
>     of two things:
>      1. No implicit ordering of VM_BIND ops.  They just happen in whatever
>     their dependencies are resolved and we ensure ordering ourselves by
>     having a syncobj in the VkQueue.
>      2. The ability to create multiple VM_BIND queues.  We need at least 2
>     but I don't see why there needs to be a limit besides the limits the
>     i915 API already has on the number of engines.  Vulkan could expose
>     multiple sparse binding queues to the client if it's not arbitrarily
>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already
has on the number of engines"? I am not sure if there is such an uapi today.

I am trying to see how many queues we need and don't want it to be arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.

>     Why?  Because Vulkan has two basic kind of bind operations and we don't
>     want any dependencies between them:
>      1. Immediate.  These happen right after BO creation or maybe as part of
>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
>     queue and we don't want them serialized with anything.  To synchronize
>     with submit, we'll have a syncobj in the VkDevice which is signaled by
>     all immediate bind operations and make submits wait on it.
>      2. Queued (sparse): These happen on a VkQueue which may be the same as
>     a render/compute queue or may be its own queue.  It's up to us what we
>     want to advertise.  From the Vulkan API PoV, this is like any other
>     queue.  Operations on it wait on and signal semaphores.  If we have a
>     VM_BIND engine, we'd provide syncobjs to wait and signal just like we do
>     in execbuf().
>     The important thing is that we don't want one type of operation to block
>     on the other.  If immediate binds are blocking on sparse binds, it's
>     going to cause over-synchronization issues.
>     In terms of the internal implementation, I know that there's going to be
>     a lock on the VM and that we can't actually do these things in
>     parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.

>     unblocked to do the bind operation, I don't care if there's a bit of
>     synchronization due to locking.  That's expected.  What we can't afford
>     to have is an immediate bind operation suddenly blocking on a sparse
>     operation which is blocked on a compute job that's going to run for
>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to clarify.

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many paging
>     queues (what they call a VM_BIND engine/queue).  That design works
>     pretty well and solves the problems in question.  Again, we could just
>     make everything out-of-order and require using syncobjs to order things
>     as userspace wants. That'd be fine too.
>     One more note while I'm here: danvet said something on IRC about VM_BIND
>     queues waiting for syncobjs to materialize.  We don't really want/need
>     this.  We already have all the machinery in userspace to handle
>     wait-before-signal and waiting for syncobj fences to materialize and
>     that machinery is on by default.  It would actually take MORE work in
>     Mesa to turn it off and take advantage of the kernel being able to wait
>     for syncobjs to materialize.  Also, getting that right is ridiculously
>     hard and I really don't want to get it wrong in kernel space.  When we
>     do memory fences, wait-before-signal will be a thing.  We don't need to
>     try and make it a thing for syncobj.
>     --Jason
>
>   Thanks Jason,
>
>   I missed the bit in the Vulkan spec that we're allowed to have a sparse
>   queue that does not implement either graphics or compute operations :
>
>     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
>     support in queue families that also include
>
>      graphics and compute support, other implementations may only expose a
>     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
>      family."
>
>   So it can all be all a vm_bind engine that just does bind/unbind
>   operations.
>
>   But yes we need another engine for the immediate/non-sparse operations.
>
>   -Lionel
>
>      
>
>       Daniel, any thoughts?
>
>       Niranjana
>
>       >Matt
>       >
>       >>
>       >> Sorry I noticed this late.
>       >>
>       >>
>       >> -Lionel
>       >>
>       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-03 23:51               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-03 23:51 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>   On 02/06/2022 23:35, Jason Ekstrand wrote:
>
>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
>       the mapping in an
>       >> > +async worker. The binding and unbinding will work like a special
>       GPU engine.
>       >> > +The binding and unbinding operations are serialized and will
>       wait on specified
>       >> > +input fences before the operation and will signal the output
>       fences upon the
>       >> > +completion of the operation. Due to serialization, completion of
>       an operation
>       >> > +will also indicate that all previous operations are also
>       complete.
>       >>
>       >> I guess we should avoid saying "will immediately start
>       binding/unbinding" if
>       >> there are fences involved.
>       >>
>       >> And the fact that it's happening in an async worker seem to imply
>       it's not
>       >> immediate.
>       >>
>
>       Ok, will fix.
>       This was added because in earlier design binding was deferred until
>       next execbuff.
>       But now it is non-deferred (immediate in that sense). But yah, this is
>       confusing
>       and will fix it.
>
>       >>
>       >> I have a question on the behavior of the bind operation when no
>       input fence
>       >> is provided. Let say I do :
>       >>
>       >> VM_BIND (out_fence=fence1)
>       >>
>       >> VM_BIND (out_fence=fence2)
>       >>
>       >> VM_BIND (out_fence=fence3)
>       >>
>       >>
>       >> In what order are the fences going to be signaled?
>       >>
>       >> In the order of VM_BIND ioctls? Or out of order?
>       >>
>       >> Because you wrote "serialized I assume it's : in order
>       >>
>
>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind
>       will use
>       the same queue and hence are ordered.
>
>       >>
>       >> One thing I didn't realize is that because we only get one
>       "VM_BIND" engine,
>       >> there is a disconnect from the Vulkan specification.
>       >>
>       >> In Vulkan VM_BIND operations are serialized but per engine.
>       >>
>       >> So you could have something like this :
>       >>
>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>       >>
>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>       >>
>       >>
>       >> fence1 is not signaled
>       >>
>       >> fence3 is signaled
>       >>
>       >> So the second VM_BIND will proceed before the first VM_BIND.
>       >>
>       >>
>       >> I guess we can deal with that scenario in userspace by doing the
>       wait
>       >> ourselves in one thread per engines.
>       >>
>       >> But then it makes the VM_BIND input fences useless.
>       >>
>       >>
>       >> Daniel : what do you think? Should be rework this or just deal with
>       wait
>       >> fences in userspace?
>       >>
>       >
>       >My opinion is rework this but make the ordering via an engine param
>       optional.
>       >
>       >e.g. A VM can be configured so all binds are ordered within the VM
>       >
>       >e.g. A VM can be configured so all binds accept an engine argument
>       (in
>       >the case of the i915 likely this is a gem context handle) and binds
>       >ordered with respect to that engine.
>       >
>       >This gives UMDs options as the later likely consumes more KMD
>       resources
>       >so if a different UMD can live with binds being ordered within the VM
>       >they can use a mode consuming less resources.
>       >
>
>       I think we need to be careful here if we are looking for some out of
>       (submission) order completion of vm_bind/unbind.
>       In-order completion means, in a batch of binds and unbinds to be
>       completed in-order, user only needs to specify in-fence for the
>       first bind/unbind call and the our-fence for the last bind/unbind
>       call. Also, the VA released by an unbind call can be re-used by
>       any subsequent bind call in that in-order batch.
>
>       These things will break if binding/unbinding were to be allowed to
>       go out of order (of submission) and user need to be extra careful
>       not to run into pre-mature triggereing of out-fence and bind failing
>       as VA is still in use etc.
>
>       Also, VM_BIND binds the provided mapping on the specified address
>       space
>       (VM). So, the uapi is not engine/context specific.
>
>       We can however add a 'queue' to the uapi which can be one from the
>       pre-defined queues,
>       I915_VM_BIND_QUEUE_0
>       I915_VM_BIND_QUEUE_1
>       ...
>       I915_VM_BIND_QUEUE_(N-1)
>
>       KMD will spawn an async work queue for each queue which will only
>       bind the mappings on that queue in the order of submission.
>       User can assign the queue to per engine or anything like that.
>
>       But again here, user need to be careful and not deadlock these
>       queues with circular dependency of fences.
>
>       I prefer adding this later an as extension based on whether it
>       is really helping with the implementation.
>
>     I can tell you right now that having everything on a single in-order
>     queue will not get us the perf we want.  What vulkan really wants is one
>     of two things:
>      1. No implicit ordering of VM_BIND ops.  They just happen in whatever
>     their dependencies are resolved and we ensure ordering ourselves by
>     having a syncobj in the VkQueue.
>      2. The ability to create multiple VM_BIND queues.  We need at least 2
>     but I don't see why there needs to be a limit besides the limits the
>     i915 API already has on the number of engines.  Vulkan could expose
>     multiple sparse binding queues to the client if it's not arbitrarily
>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already
has on the number of engines"? I am not sure if there is such an uapi today.

I am trying to see how many queues we need and don't want it to be arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.

>     Why?  Because Vulkan has two basic kind of bind operations and we don't
>     want any dependencies between them:
>      1. Immediate.  These happen right after BO creation or maybe as part of
>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
>     queue and we don't want them serialized with anything.  To synchronize
>     with submit, we'll have a syncobj in the VkDevice which is signaled by
>     all immediate bind operations and make submits wait on it.
>      2. Queued (sparse): These happen on a VkQueue which may be the same as
>     a render/compute queue or may be its own queue.  It's up to us what we
>     want to advertise.  From the Vulkan API PoV, this is like any other
>     queue.  Operations on it wait on and signal semaphores.  If we have a
>     VM_BIND engine, we'd provide syncobjs to wait and signal just like we do
>     in execbuf().
>     The important thing is that we don't want one type of operation to block
>     on the other.  If immediate binds are blocking on sparse binds, it's
>     going to cause over-synchronization issues.
>     In terms of the internal implementation, I know that there's going to be
>     a lock on the VM and that we can't actually do these things in
>     parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.

>     unblocked to do the bind operation, I don't care if there's a bit of
>     synchronization due to locking.  That's expected.  What we can't afford
>     to have is an immediate bind operation suddenly blocking on a sparse
>     operation which is blocked on a compute job that's going to run for
>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to clarify.

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many paging
>     queues (what they call a VM_BIND engine/queue).  That design works
>     pretty well and solves the problems in question.  Again, we could just
>     make everything out-of-order and require using syncobjs to order things
>     as userspace wants. That'd be fine too.
>     One more note while I'm here: danvet said something on IRC about VM_BIND
>     queues waiting for syncobjs to materialize.  We don't really want/need
>     this.  We already have all the machinery in userspace to handle
>     wait-before-signal and waiting for syncobj fences to materialize and
>     that machinery is on by default.  It would actually take MORE work in
>     Mesa to turn it off and take advantage of the kernel being able to wait
>     for syncobjs to materialize.  Also, getting that right is ridiculously
>     hard and I really don't want to get it wrong in kernel space.  When we
>     do memory fences, wait-before-signal will be a thing.  We don't need to
>     try and make it a thing for syncobj.
>     --Jason
>
>   Thanks Jason,
>
>   I missed the bit in the Vulkan spec that we're allowed to have a sparse
>   queue that does not implement either graphics or compute operations :
>
>     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
>     support in queue families that also include
>
>      graphics and compute support, other implementations may only expose a
>     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
>      family."
>
>   So it can all be all a vm_bind engine that just does bind/unbind
>   operations.
>
>   But yes we need another engine for the immediate/non-sparse operations.
>
>   -Lionel
>
>      
>
>       Daniel, any thoughts?
>
>       Niranjana
>
>       >Matt
>       >
>       >>
>       >> Sorry I noticed this late.
>       >>
>       >>
>       >> -Lionel
>       >>
>       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 20:48       ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-06-06 20:45         ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-06 20:45 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: Brost, Matthew, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, jason, Vetter,  Daniel, christian.koenig



Regards,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 2, 2022 4:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>; Brost, Matthew <matthew.brost@intel.com>;
> Hellstrom, Thomas <thomas.hellstrom@intel.com>; jason@jlekstrand.net;
> Wilson, Chris P <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
> >
> >
> >Regards,
> >Oak
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >> Niranjana Vishwanathapura
> >> Sent: May 17, 2022 2:32 PM
> >> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> >> Daniel <daniel.vetter@intel.com>
> >> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> >> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> >> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> >> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> >>
> >> VM_BIND design document with description of intended use cases.
> >>
> >> v2: Add more documentation and format as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura
> >> <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/driver-api/dma-buf.rst   |   2 +
> >>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> >> +++++++++++++++++++++++++
> >>  Documentation/gpu/rfc/index.rst        |   4 +
> >>  3 files changed, 310 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >>
> >> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> >> api/dma-buf.rst
> >> index 36a76cbe9095..64cb924ec5bb 100644
> >> --- a/Documentation/driver-api/dma-buf.rst
> >> +++ b/Documentation/driver-api/dma-buf.rst
> >> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
> >>  .. kernel-doc:: include/linux/sync_file.h
> >>     :internal:
> >>
> >> +.. _indefinite_dma_fences:
> >> +
> >>  Indefinite DMA Fences
> >>  ~~~~~~~~~~~~~~~~~~~~~
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> >> b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> new file mode 100644
> >> index 000000000000..f1be560d313c
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> @@ -0,0 +1,304 @@
> >> +==========================================
> >> +I915 VM_BIND feature design and use cases
> >> +==========================================
> >> +
> >> +VM_BIND feature
> >> +================
> >> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> >> buffer
> >> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> >> +specified address space (VM). These mappings (also referred to as persistent
> >> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> >> +issued by the UMD, without user having to provide a list of all required
> >> +mappings during each submission (as required by older execbuff mode).
> >> +
> >> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> >> +to specify how the binding/unbinding should sync with other operations
> >> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> >> +for non-Compute contexts (See struct
> >> drm_i915_vm_bind_ext_timeline_fences).
> >> +For Compute contexts, they will be user/memory fences (See struct
> >> +drm_i915_vm_bind_ext_user_fence).
> >> +
> >> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> >> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> >> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> extension.
> >> +
> >> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in
> >> an
> >> +async worker. The binding and unbinding will work like a special GPU engine.
> >> +The binding and unbinding operations are serialized and will wait on specified
> >> +input fences before the operation and will signal the output fences upon the
> >> +completion of the operation. Due to serialization, completion of an operation
> >> +will also indicate that all previous operations are also complete.
> >
> >Hi,
> >
> >Is user required to wait for the out fence be signaled before submit a gpu job
> using the vm_bind address?
> >Or is user required to order the gpu job to make gpu job run after vm_bind out
> fence signaled?
> >
> 
> Thanks Oak,
> Either should be fine and up to user how to use vm_bind/unbind out-fence.
> 
> >I think there could be different behavior on a non-faultable platform and a
> faultable platform, such as on a non-faultable
> >Platform, gpu job is required to be order after vm_bind out fence signaling; and
> on a faultable platform, there is no such
> >Restriction since vm bind can be finished in the fault handler?
> >
> 
> With GPU page faults handler, out fence won't be needed as residency is
> purely managed by page fault handler populating page tables (there is a
> mention of it in GPU Page Faults section below).
> 
> >Should we document such thing?
> >
> 
> We don't talk much about GPU page faults case in this document as that may
> warrent a separate rfc when we add page faults support. We did mention it
> in couple places to ensure our locking design here is extensible to gpu
> page faults case.

Ok, that makes sense to me. Thanks for explaining.

Regards,
Oak

> 
> Niranjana
> 
> >Regards,
> >Oak
> >
> >
> >> +
> >> +VM_BIND features include:
> >> +
> >> +* Multiple Virtual Address (VA) mappings can map to the same physical
> pages
> >> +  of an object (aliasing).
> >> +* VA mapping can map to a partial section of the BO (partial binding).
> >> +* Support capture of persistent mappings in the dump upon GPU error.
> >> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> >> +  use cases will be helpful.
> >> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> >> +* Support for userptr gem objects (no special uapi is required for this).
> >> +
> >> +Execbuff ioctl in VM_BIND mode
> >> +-------------------------------
> >> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> >> +older method. A VM in VM_BIND mode will not support older execbuff
> mode of
> >> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.
> Hence,
> >> +no support for implicit sync. It is expected that the below work will be able
> >> +to support requirements of object dependency setting in all use cases:
> >> +
> >> +"dma-buf: Add an API for exporting sync files"
> >> +(https://lwn.net/Articles/859290/)
> >> +
> >> +This also means, we need an execbuff extension to pass in the batch
> >> +buffer addresses (See struct
> >> drm_i915_gem_execbuffer_ext_batch_addresses).
> >> +
> >> +If at all execlist support in execbuff ioctl is deemed necessary for
> >> +implicit sync in certain use cases, then support can be added later.
> >> +
> >> +In VM_BIND mode, VA allocation is completely managed by the user instead
> of
> >> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> >> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode
> will
> >> not
> >> +be using the i915_vma active reference tracking. It will instead use dma-resv
> >> +object for that (See `VM_BIND dma_resv usage`_).
> >> +
> >> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> >> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> >> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned
> up
> >> +by clearly separating out the functionalities where the VM_BIND mode
> differs
> >> +from older method and they should be moved to separate files.
> >> +
> >> +VM_PRIVATE objects
> >> +-------------------
> >> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> >> +exported. Hence these BOs are referred to as Shared BOs.
> >> +During each execbuff submission, the request fence must be added to the
> >> +dma-resv fence list of all shared BOs mapped on the VM.
> >> +
> >> +VM_BIND feature introduces an optimization where user can create BO
> which
> >> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> >> during
> >> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped
> on
> >> +the VM they are private to and can't be dma-buf exported.
> >> +All private BOs of a VM share the dma-resv object. Hence during each
> execbuff
> >> +submission, they need only one dma-resv fence list updated. Thus, the fast
> >> +path (where required mappings are already bound) submission latency is
> O(1)
> >> +w.r.t the number of VM private BOs.
> >> +
> >> +VM_BIND locking hirarchy
> >> +-------------------------
> >> +The locking design here supports the older (execlist based) execbuff mode,
> the
> >> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and
> possible
> >> future
> >> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> >> +The older execbuff mode and the newer VM_BIND mode without page
> faults
> >> manages
> >> +residency of backing storage using dma_fence. The VM_BIND mode with
> page
> >> faults
> >> +and the system allocator support do not use any dma_fence at all.
> >> +
> >> +VM_BIND locking order is as below.
> >> +
> >> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> >> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing
> the
> >> +   mapping.
> >> +
> >> +   In future, when GPU page faults are supported, we can potentially use a
> >> +   rwsem instead, so that multiple page fault handlers can take the read side
> >> +   lock to lookup the mapping and hence can run in parallel.
> >> +   The older execbuff mode of binding do not need this lock.
> >> +
> >> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs
> to
> >> +   be held while binding/unbinding a vma in the async worker and while
> updating
> >> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> >> +   share a dma-resv object.
> >> +
> >> +   The future system allocator support will use the HMM prescribed locking
> >> +   instead.
> >> +
> >> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> >> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> >> +
> >> +When GPU page faults are supported, the execbuff path do not take any of
> >> these
> >> +locks. There we will simply smash the new batch buffer address into the ring
> >> and
> >> +then tell the scheduler run that. The lock taking only happens from the page
> >> +fault handler, where we take lock-A in read mode, whichever lock-B we
> need to
> >> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm
> for
> >> +system allocator) and some additional locks (lock-D) for taking care of page
> >> +table races. Page fault mode should not need to ever manipulate the vm
> lists,
> >> +so won't ever need lock-C.
> >> +
> >> +VM_BIND LRU handling
> >> +---------------------
> >> +We need to ensure VM_BIND mapped objects are properly LRU tagged to
> avoid
> >> +performance degradation. We will also need support for bulk LRU movement
> of
> >> +VM_BIND objects to avoid additional latencies in execbuff path.
> >> +
> >> +The page table pages are similar to VM_BIND mapped objects (See
> >> +`Evictable page table allocations`_) and are maintained per VM and needs to
> >> +be pinned in memory when VM is made active (ie., upon an execbuff call
> with
> >> +that VM). So, bulk LRU movement of page table pages is also needed.
> >> +
> >> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> >> +over to the ttm LRU in some fashion to make sure we once again have a
> >> reasonable
> >> +and consistent memory aging and reclaim architecture.
> >> +
> >> +VM_BIND dma_resv usage
> >> +-----------------------
> >> +Fences needs to be added to all VM_BIND mapped objects. During each
> >> execbuff
> >> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> >> prevent
> >> +over sync (See enum dma_resv_usage). One can override it with either
> >> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during
> object
> >> dependency
> >> +setting (either through explicit or implicit mechanism).
> >> +
> >> +When vm_bind is called for a non-private object while the VM is already
> >> +active, the fences need to be copied from VM's shared dma-resv object
> >> +(common to all private objects of the VM) to this non-private object.
> >> +If this results in performance degradation, then some optimization will
> >> +be needed here. This is not a problem for VM's private objects as they use
> >> +shared dma-resv object which is always updated on each execbuff
> submission.
> >> +
> >> +Also, in VM_BIND mode, use dma-resv apis for determining object
> activeness
> >> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not
> use
> >> the
> >> +older i915_vma active reference tracking which is deprecated. This should be
> >> +easier to get it working with the current TTM backend. We can remove the
> >> +i915_vma active reference tracking fully while supporting TTM backend for
> igfx.
> >> +
> >> +Evictable page table allocations
> >> +---------------------------------
> >> +Make pagetable allocations evictable and manage them similar to VM_BIND
> >> +mapped objects. Page table pages are similar to persistent mappings of a
> >> +VM (difference here are that the page table pages will not have an i915_vma
> >> +structure and after swapping pages back in, parent page link needs to be
> >> +updated).
> >> +
> >> +Mesa use case
> >> +--------------
> >> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan
> and
> >> Iris),
> >> +hence improving performance of CPU-bound applications. It also allows us to
> >> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> >> performance,
> >> +reducing CPU overhead becomes more impactful.
> >> +
> >> +
> >> +VM_BIND Compute support
> >> +========================
> >> +
> >> +User/Memory Fence
> >> +------------------
> >> +The idea is to take a user specified virtual address and install an interrupt
> >> +handler to wake up the current task when the memory location passes the
> user
> >> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> >> +user fence, specified value will be written at the specified virtual address
> >> +and wakeup the waiting process. User can wait on a user fence with the
> >> +gem_wait_user_fence ioctl.
> >> +
> >> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> >> +interrupt within their batches after updating the value to have sub-batch
> >> +precision on the wakeup. Each batch can signal a user fence to indicate
> >> +the completion of next level batch. The completion of very first level batch
> >> +needs to be signaled by the command streamer. The user must provide the
> >> +user/memory fence for this via the
> >> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> >> +extension of execbuff ioctl, so that KMD can setup the command streamer
> to
> >> +signal it.
> >> +
> >> +User/Memory fence can also be supplied to the kernel driver to signal/wake
> up
> >> +the user process after completion of an asynchronous operation.
> >> +
> >> +When VM_BIND ioctl was provided with a user/memory fence via the
> >> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> >> completion
> >> +of binding of that mapping. All async binds/unbinds are serialized, hence
> >> +signaling of user/memory fence also indicate the completion of all previous
> >> +binds/unbinds.
> >> +
> >> +This feature will be derived from the below original work:
> >> +https://patchwork.freedesktop.org/patch/349417/
> >> +
> >> +Long running Compute contexts
> >> +------------------------------
> >> +Usage of dma-fence expects that they complete in reasonable amount of
> time.
> >> +Compute on the other hand can be long running. Hence it is appropriate for
> >> +compute to use user/memory fence and dma-fence usage will be limited to
> >> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> >> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute
> must
> >> opt-in
> >> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> flag
> >> during
> >> +context creation. The dma-fence based user interfaces like gem_wait ioctl
> and
> >> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> >> +not valid as well and is anyway not supported in VM_BIND mode.
> >> +
> >> +Where GPU page faults are not available, kernel driver upon buffer
> invalidation
> >> +will initiate a suspend (preemption) of long running context with a dma-
> fence
> >> +attached to it. And upon completion of that suspend fence, finish the
> >> +invalidation, revalidate the BO and then resume the compute context. This is
> >> +done by having a per-context preempt fence (also called suspend fence)
> >> proxying
> >> +as i915_request fence. This suspend fence is enabled when someone tries to
> >> wait
> >> +on it, which then triggers the context preemption.
> >> +
> >> +As this support for context suspension using a preempt fence and the
> resume
> >> work
> >> +for the compute mode contexts can get tricky to get it right, it is better to
> >> +add this support in drm scheduler so that multiple drivers can make use of it.
> >> +That means, it will have a dependency on i915 drm scheduler conversion with
> >> GuC
> >> +scheduler backend. This should be fine, as the plan is to support compute
> mode
> >> +contexts only with GuC scheduler backend (at least initially). This is much
> >> +easier to support with VM_BIND mode compared to the current heavier
> >> execbuff
> >> +path resource attachment.
> >> +
> >> +Low Latency Submission
> >> +-----------------------
> >> +Allows compute UMD to directly submit GPU jobs instead of through
> execbuff
> >> +ioctl. This is made possible by VM_BIND is not being synchronized against
> >> +execbuff. VM_BIND allows bind/unbind of mappings required for the
> directly
> >> +submitted jobs.
> >> +
> >> +Other VM_BIND use cases
> >> +========================
> >> +
> >> +Debugger
> >> +---------
> >> +With debug event interface user space process (debugger) is able to keep
> track
> >> +of and act upon resources created by another process (debugged) and
> attached
> >> +to GPU via vm_bind interface.
> >> +
> >> +GPU page faults
> >> +----------------
> >> +GPU page faults when supported (in future), will only be supported in the
> >> +VM_BIND mode. While both the older execbuff mode and the newer
> VM_BIND
> >> mode of
> >> +binding will require using dma-fence to ensure residency, the GPU page
> faults
> >> +mode when supported, will not use any dma-fence as residency is purely
> >> managed
> >> +by installing and removing/invalidating page table entries.
> >> +
> >> +Page level hints settings
> >> +--------------------------
> >> +VM_BIND allows any hints setting per mapping instead of per BO.
> >> +Possible hints include read-only mapping, placement and atomicity.
> >> +Sub-BO level placement hint will be even more relevant with
> >> +upcoming GPU on-demand page fault support.
> >> +
> >> +Page level Cache/CLOS settings
> >> +-------------------------------
> >> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> >> +
> >> +Shared Virtual Memory (SVM) support
> >> +------------------------------------
> >> +VM_BIND interface can be used to map system memory directly (without
> gem
> >> BO
> >> +abstraction) using the HMM interface. SVM is only supported with GPU page
> >> +faults enabled.
> >> +
> >> +
> >> +Broder i915 cleanups
> >> +=====================
> >> +Supporting this whole new vm_bind mode of binding which comes with its
> own
> >> +use cases to support and the locking requirements requires proper
> integration
> >> +with the existing i915 driver. This calls for some broader i915 driver
> >> +cleanups/simplifications for maintainability of the driver going forward.
> >> +Here are few things identified and are being looked into.
> >> +
> >> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> >> feature
> >> +  do not use it and complexity it brings in is probably more than the
> >> +  performance advantage we get in legacy execbuff case.
> >> +- Remove vma->open_count counting
> >> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> >> using
> >> +  it. Instead use underlying BO's dma-resv fence list to determine if a
> i915_vma
> >> +  is active or not.
> >> +
> >> +
> >> +VM_BIND UAPI
> >> +=============
> >> +
> >> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> >> diff --git a/Documentation/gpu/rfc/index.rst
> b/Documentation/gpu/rfc/index.rst
> >> index 91e93a705230..7d10c36b268d 100644
> >> --- a/Documentation/gpu/rfc/index.rst
> >> +++ b/Documentation/gpu/rfc/index.rst
> >> @@ -23,3 +23,7 @@ host such documentation:
> >>  .. toctree::
> >>
> >>      i915_scheduler.rst
> >> +
> >> +.. toctree::
> >> +
> >> +    i915_vm_bind.rst
> >> --
> >> 2.21.0.rc0.32.g243a4c7e27
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-06 20:45         ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-06 20:45 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	 Daniel, christian.koenig



Regards,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 2, 2022 4:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>; Brost, Matthew <matthew.brost@intel.com>;
> Hellstrom, Thomas <thomas.hellstrom@intel.com>; jason@jlekstrand.net;
> Wilson, Chris P <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
> >
> >
> >Regards,
> >Oak
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >> Niranjana Vishwanathapura
> >> Sent: May 17, 2022 2:32 PM
> >> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> >> Daniel <daniel.vetter@intel.com>
> >> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> >> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> >> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> >> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> >>
> >> VM_BIND design document with description of intended use cases.
> >>
> >> v2: Add more documentation and format as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura
> >> <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/driver-api/dma-buf.rst   |   2 +
> >>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> >> +++++++++++++++++++++++++
> >>  Documentation/gpu/rfc/index.rst        |   4 +
> >>  3 files changed, 310 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >>
> >> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> >> api/dma-buf.rst
> >> index 36a76cbe9095..64cb924ec5bb 100644
> >> --- a/Documentation/driver-api/dma-buf.rst
> >> +++ b/Documentation/driver-api/dma-buf.rst
> >> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
> >>  .. kernel-doc:: include/linux/sync_file.h
> >>     :internal:
> >>
> >> +.. _indefinite_dma_fences:
> >> +
> >>  Indefinite DMA Fences
> >>  ~~~~~~~~~~~~~~~~~~~~~
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> >> b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> new file mode 100644
> >> index 000000000000..f1be560d313c
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> @@ -0,0 +1,304 @@
> >> +==========================================
> >> +I915 VM_BIND feature design and use cases
> >> +==========================================
> >> +
> >> +VM_BIND feature
> >> +================
> >> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> >> buffer
> >> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> >> +specified address space (VM). These mappings (also referred to as persistent
> >> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> >> +issued by the UMD, without user having to provide a list of all required
> >> +mappings during each submission (as required by older execbuff mode).
> >> +
> >> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> >> +to specify how the binding/unbinding should sync with other operations
> >> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> >> +for non-Compute contexts (See struct
> >> drm_i915_vm_bind_ext_timeline_fences).
> >> +For Compute contexts, they will be user/memory fences (See struct
> >> +drm_i915_vm_bind_ext_user_fence).
> >> +
> >> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> >> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> >> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> extension.
> >> +
> >> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in
> >> an
> >> +async worker. The binding and unbinding will work like a special GPU engine.
> >> +The binding and unbinding operations are serialized and will wait on specified
> >> +input fences before the operation and will signal the output fences upon the
> >> +completion of the operation. Due to serialization, completion of an operation
> >> +will also indicate that all previous operations are also complete.
> >
> >Hi,
> >
> >Is user required to wait for the out fence be signaled before submit a gpu job
> using the vm_bind address?
> >Or is user required to order the gpu job to make gpu job run after vm_bind out
> fence signaled?
> >
> 
> Thanks Oak,
> Either should be fine and up to user how to use vm_bind/unbind out-fence.
> 
> >I think there could be different behavior on a non-faultable platform and a
> faultable platform, such as on a non-faultable
> >Platform, gpu job is required to be order after vm_bind out fence signaling; and
> on a faultable platform, there is no such
> >Restriction since vm bind can be finished in the fault handler?
> >
> 
> With GPU page faults handler, out fence won't be needed as residency is
> purely managed by page fault handler populating page tables (there is a
> mention of it in GPU Page Faults section below).
> 
> >Should we document such thing?
> >
> 
> We don't talk much about GPU page faults case in this document as that may
> warrent a separate rfc when we add page faults support. We did mention it
> in couple places to ensure our locking design here is extensible to gpu
> page faults case.

Ok, that makes sense to me. Thanks for explaining.

Regards,
Oak

> 
> Niranjana
> 
> >Regards,
> >Oak
> >
> >
> >> +
> >> +VM_BIND features include:
> >> +
> >> +* Multiple Virtual Address (VA) mappings can map to the same physical
> pages
> >> +  of an object (aliasing).
> >> +* VA mapping can map to a partial section of the BO (partial binding).
> >> +* Support capture of persistent mappings in the dump upon GPU error.
> >> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> >> +  use cases will be helpful.
> >> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> >> +* Support for userptr gem objects (no special uapi is required for this).
> >> +
> >> +Execbuff ioctl in VM_BIND mode
> >> +-------------------------------
> >> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> >> +older method. A VM in VM_BIND mode will not support older execbuff
> mode of
> >> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.
> Hence,
> >> +no support for implicit sync. It is expected that the below work will be able
> >> +to support requirements of object dependency setting in all use cases:
> >> +
> >> +"dma-buf: Add an API for exporting sync files"
> >> +(https://lwn.net/Articles/859290/)
> >> +
> >> +This also means, we need an execbuff extension to pass in the batch
> >> +buffer addresses (See struct
> >> drm_i915_gem_execbuffer_ext_batch_addresses).
> >> +
> >> +If at all execlist support in execbuff ioctl is deemed necessary for
> >> +implicit sync in certain use cases, then support can be added later.
> >> +
> >> +In VM_BIND mode, VA allocation is completely managed by the user instead
> of
> >> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> >> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode
> will
> >> not
> >> +be using the i915_vma active reference tracking. It will instead use dma-resv
> >> +object for that (See `VM_BIND dma_resv usage`_).
> >> +
> >> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> >> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> >> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned
> up
> >> +by clearly separating out the functionalities where the VM_BIND mode
> differs
> >> +from older method and they should be moved to separate files.
> >> +
> >> +VM_PRIVATE objects
> >> +-------------------
> >> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> >> +exported. Hence these BOs are referred to as Shared BOs.
> >> +During each execbuff submission, the request fence must be added to the
> >> +dma-resv fence list of all shared BOs mapped on the VM.
> >> +
> >> +VM_BIND feature introduces an optimization where user can create BO
> which
> >> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> >> during
> >> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped
> on
> >> +the VM they are private to and can't be dma-buf exported.
> >> +All private BOs of a VM share the dma-resv object. Hence during each
> execbuff
> >> +submission, they need only one dma-resv fence list updated. Thus, the fast
> >> +path (where required mappings are already bound) submission latency is
> O(1)
> >> +w.r.t the number of VM private BOs.
> >> +
> >> +VM_BIND locking hirarchy
> >> +-------------------------
> >> +The locking design here supports the older (execlist based) execbuff mode,
> the
> >> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and
> possible
> >> future
> >> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> >> +The older execbuff mode and the newer VM_BIND mode without page
> faults
> >> manages
> >> +residency of backing storage using dma_fence. The VM_BIND mode with
> page
> >> faults
> >> +and the system allocator support do not use any dma_fence at all.
> >> +
> >> +VM_BIND locking order is as below.
> >> +
> >> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> >> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing
> the
> >> +   mapping.
> >> +
> >> +   In future, when GPU page faults are supported, we can potentially use a
> >> +   rwsem instead, so that multiple page fault handlers can take the read side
> >> +   lock to lookup the mapping and hence can run in parallel.
> >> +   The older execbuff mode of binding do not need this lock.
> >> +
> >> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs
> to
> >> +   be held while binding/unbinding a vma in the async worker and while
> updating
> >> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> >> +   share a dma-resv object.
> >> +
> >> +   The future system allocator support will use the HMM prescribed locking
> >> +   instead.
> >> +
> >> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> >> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> >> +
> >> +When GPU page faults are supported, the execbuff path do not take any of
> >> these
> >> +locks. There we will simply smash the new batch buffer address into the ring
> >> and
> >> +then tell the scheduler run that. The lock taking only happens from the page
> >> +fault handler, where we take lock-A in read mode, whichever lock-B we
> need to
> >> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm
> for
> >> +system allocator) and some additional locks (lock-D) for taking care of page
> >> +table races. Page fault mode should not need to ever manipulate the vm
> lists,
> >> +so won't ever need lock-C.
> >> +
> >> +VM_BIND LRU handling
> >> +---------------------
> >> +We need to ensure VM_BIND mapped objects are properly LRU tagged to
> avoid
> >> +performance degradation. We will also need support for bulk LRU movement
> of
> >> +VM_BIND objects to avoid additional latencies in execbuff path.
> >> +
> >> +The page table pages are similar to VM_BIND mapped objects (See
> >> +`Evictable page table allocations`_) and are maintained per VM and needs to
> >> +be pinned in memory when VM is made active (ie., upon an execbuff call
> with
> >> +that VM). So, bulk LRU movement of page table pages is also needed.
> >> +
> >> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> >> +over to the ttm LRU in some fashion to make sure we once again have a
> >> reasonable
> >> +and consistent memory aging and reclaim architecture.
> >> +
> >> +VM_BIND dma_resv usage
> >> +-----------------------
> >> +Fences needs to be added to all VM_BIND mapped objects. During each
> >> execbuff
> >> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> >> prevent
> >> +over sync (See enum dma_resv_usage). One can override it with either
> >> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during
> object
> >> dependency
> >> +setting (either through explicit or implicit mechanism).
> >> +
> >> +When vm_bind is called for a non-private object while the VM is already
> >> +active, the fences need to be copied from VM's shared dma-resv object
> >> +(common to all private objects of the VM) to this non-private object.
> >> +If this results in performance degradation, then some optimization will
> >> +be needed here. This is not a problem for VM's private objects as they use
> >> +shared dma-resv object which is always updated on each execbuff
> submission.
> >> +
> >> +Also, in VM_BIND mode, use dma-resv apis for determining object
> activeness
> >> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not
> use
> >> the
> >> +older i915_vma active reference tracking which is deprecated. This should be
> >> +easier to get it working with the current TTM backend. We can remove the
> >> +i915_vma active reference tracking fully while supporting TTM backend for
> igfx.
> >> +
> >> +Evictable page table allocations
> >> +---------------------------------
> >> +Make pagetable allocations evictable and manage them similar to VM_BIND
> >> +mapped objects. Page table pages are similar to persistent mappings of a
> >> +VM (difference here are that the page table pages will not have an i915_vma
> >> +structure and after swapping pages back in, parent page link needs to be
> >> +updated).
> >> +
> >> +Mesa use case
> >> +--------------
> >> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan
> and
> >> Iris),
> >> +hence improving performance of CPU-bound applications. It also allows us to
> >> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> >> performance,
> >> +reducing CPU overhead becomes more impactful.
> >> +
> >> +
> >> +VM_BIND Compute support
> >> +========================
> >> +
> >> +User/Memory Fence
> >> +------------------
> >> +The idea is to take a user specified virtual address and install an interrupt
> >> +handler to wake up the current task when the memory location passes the
> user
> >> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> >> +user fence, specified value will be written at the specified virtual address
> >> +and wakeup the waiting process. User can wait on a user fence with the
> >> +gem_wait_user_fence ioctl.
> >> +
> >> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> >> +interrupt within their batches after updating the value to have sub-batch
> >> +precision on the wakeup. Each batch can signal a user fence to indicate
> >> +the completion of next level batch. The completion of very first level batch
> >> +needs to be signaled by the command streamer. The user must provide the
> >> +user/memory fence for this via the
> >> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> >> +extension of execbuff ioctl, so that KMD can setup the command streamer
> to
> >> +signal it.
> >> +
> >> +User/Memory fence can also be supplied to the kernel driver to signal/wake
> up
> >> +the user process after completion of an asynchronous operation.
> >> +
> >> +When VM_BIND ioctl was provided with a user/memory fence via the
> >> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> >> completion
> >> +of binding of that mapping. All async binds/unbinds are serialized, hence
> >> +signaling of user/memory fence also indicate the completion of all previous
> >> +binds/unbinds.
> >> +
> >> +This feature will be derived from the below original work:
> >> +https://patchwork.freedesktop.org/patch/349417/
> >> +
> >> +Long running Compute contexts
> >> +------------------------------
> >> +Usage of dma-fence expects that they complete in reasonable amount of
> time.
> >> +Compute on the other hand can be long running. Hence it is appropriate for
> >> +compute to use user/memory fence and dma-fence usage will be limited to
> >> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> >> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute
> must
> >> opt-in
> >> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> flag
> >> during
> >> +context creation. The dma-fence based user interfaces like gem_wait ioctl
> and
> >> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> >> +not valid as well and is anyway not supported in VM_BIND mode.
> >> +
> >> +Where GPU page faults are not available, kernel driver upon buffer
> invalidation
> >> +will initiate a suspend (preemption) of long running context with a dma-
> fence
> >> +attached to it. And upon completion of that suspend fence, finish the
> >> +invalidation, revalidate the BO and then resume the compute context. This is
> >> +done by having a per-context preempt fence (also called suspend fence)
> >> proxying
> >> +as i915_request fence. This suspend fence is enabled when someone tries to
> >> wait
> >> +on it, which then triggers the context preemption.
> >> +
> >> +As this support for context suspension using a preempt fence and the
> resume
> >> work
> >> +for the compute mode contexts can get tricky to get it right, it is better to
> >> +add this support in drm scheduler so that multiple drivers can make use of it.
> >> +That means, it will have a dependency on i915 drm scheduler conversion with
> >> GuC
> >> +scheduler backend. This should be fine, as the plan is to support compute
> mode
> >> +contexts only with GuC scheduler backend (at least initially). This is much
> >> +easier to support with VM_BIND mode compared to the current heavier
> >> execbuff
> >> +path resource attachment.
> >> +
> >> +Low Latency Submission
> >> +-----------------------
> >> +Allows compute UMD to directly submit GPU jobs instead of through
> execbuff
> >> +ioctl. This is made possible by VM_BIND is not being synchronized against
> >> +execbuff. VM_BIND allows bind/unbind of mappings required for the
> directly
> >> +submitted jobs.
> >> +
> >> +Other VM_BIND use cases
> >> +========================
> >> +
> >> +Debugger
> >> +---------
> >> +With debug event interface user space process (debugger) is able to keep
> track
> >> +of and act upon resources created by another process (debugged) and
> attached
> >> +to GPU via vm_bind interface.
> >> +
> >> +GPU page faults
> >> +----------------
> >> +GPU page faults when supported (in future), will only be supported in the
> >> +VM_BIND mode. While both the older execbuff mode and the newer
> VM_BIND
> >> mode of
> >> +binding will require using dma-fence to ensure residency, the GPU page
> faults
> >> +mode when supported, will not use any dma-fence as residency is purely
> >> managed
> >> +by installing and removing/invalidating page table entries.
> >> +
> >> +Page level hints settings
> >> +--------------------------
> >> +VM_BIND allows any hints setting per mapping instead of per BO.
> >> +Possible hints include read-only mapping, placement and atomicity.
> >> +Sub-BO level placement hint will be even more relevant with
> >> +upcoming GPU on-demand page fault support.
> >> +
> >> +Page level Cache/CLOS settings
> >> +-------------------------------
> >> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> >> +
> >> +Shared Virtual Memory (SVM) support
> >> +------------------------------------
> >> +VM_BIND interface can be used to map system memory directly (without
> gem
> >> BO
> >> +abstraction) using the HMM interface. SVM is only supported with GPU page
> >> +faults enabled.
> >> +
> >> +
> >> +Broder i915 cleanups
> >> +=====================
> >> +Supporting this whole new vm_bind mode of binding which comes with its
> own
> >> +use cases to support and the locking requirements requires proper
> integration
> >> +with the existing i915 driver. This calls for some broader i915 driver
> >> +cleanups/simplifications for maintainability of the driver going forward.
> >> +Here are few things identified and are being looked into.
> >> +
> >> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> >> feature
> >> +  do not use it and complexity it brings in is probably more than the
> >> +  performance advantage we get in legacy execbuff case.
> >> +- Remove vma->open_count counting
> >> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> >> using
> >> +  it. Instead use underlying BO's dma-resv fence list to determine if a
> i915_vma
> >> +  is active or not.
> >> +
> >> +
> >> +VM_BIND UAPI
> >> +=============
> >> +
> >> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> >> diff --git a/Documentation/gpu/rfc/index.rst
> b/Documentation/gpu/rfc/index.rst
> >> index 91e93a705230..7d10c36b268d 100644
> >> --- a/Documentation/gpu/rfc/index.rst
> >> +++ b/Documentation/gpu/rfc/index.rst
> >> @@ -23,3 +23,7 @@ host such documentation:
> >>  .. toctree::
> >>
> >>      i915_scheduler.rst
> >> +
> >> +.. toctree::
> >> +
> >> +    i915_vm_bind.rst
> >> --
> >> 2.21.0.rc0.32.g243a4c7e27
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32   ` Niranjana Vishwanathapura
  (?)
  (?)
@ 2022-06-07 10:27   ` Tvrtko Ursulin
  2022-06-07 19:37     ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-07 10:27 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, christian.koenig, chris.p.wilson


On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
> VM_BIND and related uapi definitions
> 
> v2: Ensure proper kernel-doc formatting with cross references.
>      Also add new uapi and documentation as per review comments
>      from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>   Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>   1 file changed, 399 insertions(+)
>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..589c0a009107
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,399 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND		57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> + * to pass in the batch buffer addresses.
> + *
> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> + */
> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
> +
> +/**
> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> + *
> + * Flag to declare context as long running.
> + * See struct drm_i915_gem_context_create_ext flags.
> + *
> + * Usage of dma-fence expects that they complete in reasonable amount of time.
> + * Compute on the other hand can be long running. Hence it is not appropriate
> + * for compute contexts to export request completion dma-fence to user.
> + * The dma-fence usage will be limited to in-kernel consumption only.
> + * Compute contexts need to use user/memory fence.
> + *
> + * So, long running contexts do not support output fences. Hence,
> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
> + * to be not used.
> + *
> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
> + * to long running contexts.
> + */
> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
> +
> +/* VM_BIND related ioctls */
> +#define DRM_I915_GEM_VM_BIND		0x3d
> +#define DRM_I915_GEM_VM_UNBIND		0x3e
> +#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
> +
> +#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
> +
> +/**
> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
> + *
> + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
> + * virtual address (VA) range to the section of an object that should be bound
> + * in the device page table of the specified address space (VM).
> + * The VA range specified must be unique (ie., not currently bound) and can
> + * be mapped to whole object or a section of the object (partial binding).
> + * Multiple VA mappings can be created to the same section of the object
> + * (aliasing).
> + */
> +struct drm_i915_gem_vm_bind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @handle: Object handle */
> +	__u32 handle;
> +
> +	/** @start: Virtual Address start to bind */
> +	__u64 start;
> +
> +	/** @offset: Offset in object to bind */
> +	__u64 offset;
> +
> +	/** @length: Length of mapping to bind */
> +	__u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or 
if not userspace is expected to map the remainder of the space to a 
dummy object? In which case would there be any alignment/padding issues 
preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem 
with their strategy of dealing with overfetch and I suggested pad to size.

Regards,

Tvrtko

> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_GEM_VM_BIND_READONLY:
> +	 * Mapping is read-only.
> +	 *
> +	 * I915_GEM_VM_BIND_CAPTURE:
> +	 * Capture this mapping in the dump upon GPU error.
> +	 */
> +	__u64 flags;
> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
> + *
> + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
> + * address (VA) range that should be unbound from the device page table of the
> + * specified address space (VM). The specified VA range must match one of the
> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
> + * completion.
> + */
> +struct drm_i915_gem_vm_unbind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @rsvd: Reserved for future use; must be zero. */
> +	__u32 rsvd;
> +
> +	/** @start: Virtual Address start to unbind */
> +	__u64 start;
> +
> +	/** @length: Length of mapping to unbind */
> +	__u64 length;
> +
> +	/** @flags: reserved for future usage, currently MBZ */
> +	__u64 flags;
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
> + * or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for input fence to signal
> + * before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the returned output fence
> + * after the completion of binding or unbinding.
> + */
> +struct drm_i915_vm_bind_fence {
> +	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
> +	__u32 handle;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
> + * and vm_unbind.
> + *
> + * This structure describes an array of timeline drm_syncobj and associated
> + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
> + */
> +struct drm_i915_vm_bind_ext_timeline_fences {
> +#define I915_VM_BIND_EXT_timeline_FENCES	0
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
> +	 * arrays.
> +	 */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
> +	 * of length @fence_count.
> +	 */
> +	__u64 handles_ptr;
> +
> +	/**
> +	 * @values_ptr: Pointer to an array of u64 values of length
> +	 * @fence_count.
> +	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
> +	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +	 * binary one.
> +	 */
> +	__u64 values_ptr;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
> + * vm_bind or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
> + * @addr to become equal to @val) before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the output fence after
> + * the completion of binding or unbinding by writing @val to memory location at
> + * @addr
> + */
> +struct drm_i915_vm_bind_user_fence {
> +	/** @addr: User/Memory fence qword aligned process virtual address */
> +	__u64 addr;
> +
> +	/** @val: User/Memory fence value to be written after bind completion */
> +	__u64 val;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
> +	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
> + * and vm_unbind.
> + *
> + * These user fences can be input or output fences
> + * (See struct drm_i915_vm_bind_user_fence).
> + */
> +struct drm_i915_vm_bind_ext_user_fence {
> +#define I915_VM_BIND_EXT_USER_FENCES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @fence_count: Number of elements in the @user_fence_ptr array. */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @user_fence_ptr: Pointer to an array of
> +	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
> +	 */
> +	__u64 user_fence_ptr;
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
> + * gpu virtual addresses.
> + *
> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
> + * must always be appended in the VM_BIND mode and it will be an error to
> + * append this extension in older non-VM_BIND mode.
> + */
> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @count: Number of addresses in the addr array. */
> +	__u32 count;
> +
> +	/** @addr: An array of batch gpu virtual addresses. */
> +	__u64 addr[0];
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
> + * signaling extension.
> + *
> + * This extension allows user to attach a user fence (@addr, @value pair) to an
> + * execbuf to be signaled by the command streamer after the completion of first
> + * level batch, by writing the @value at specified @addr and triggering an
> + * interrupt.
> + * User can either poll for this user fence to signal or can also wait on it
> + * with i915_gem_wait_user_fence ioctl.
> + * This is very much usefaul for long running contexts where waiting on dma-fence
> + * by user (like i915_gem_wait ioctl) is not supported.
> + */
> +struct drm_i915_gem_execbuffer_ext_user_fence {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @addr: User/Memory fence qword aligned GPU virtual address.
> +	 *
> +	 * Address has to be a valid GPU virtual address at the time of
> +	 * first level batch completion.
> +	 */
> +	__u64 addr;
> +
> +	/**
> +	 * @value: User/Memory fence Value to be written to above address
> +	 * after first level batch completes.
> +	 */
> +	__u64 value;
> +
> +	/** @rsvd: Reserved for future extensions, MBZ */
> +	__u64 rsvd;
> +};
> +
> +/**
> + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
> + * private to the specified VM.
> + *
> + * See struct drm_i915_gem_create_ext.
> + */
> +struct drm_i915_gem_create_ext_vm_private {
> +#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @vm_id: Id of the VM to which the object is private */
> +	__u32 vm_id;
> +};
> +
> +/**
> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
> + *
> + * User/Memory fence can be woken up either by:
> + *
> + * 1. GPU context indicated by @ctx_id, or,
> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
> + *    @ctx_id is ignored when this flag is set.
> + *
> + * Wakeup condition is,
> + * ``((*addr & mask) op (value & mask))``
> + *
> + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
> + */
> +struct drm_i915_gem_wait_user_fence {
> +	/** @extensions: Zero-terminated chain of extensions. */
> +	__u64 extensions;
> +
> +	/** @addr: User/Memory fence address */
> +	__u64 addr;
> +
> +	/** @ctx_id: Id of the Context which will signal the fence. */
> +	__u32 ctx_id;
> +
> +	/** @op: Wakeup condition operator */
> +	__u16 op;
> +#define I915_UFENCE_WAIT_EQ      0
> +#define I915_UFENCE_WAIT_NEQ     1
> +#define I915_UFENCE_WAIT_GT      2
> +#define I915_UFENCE_WAIT_GTE     3
> +#define I915_UFENCE_WAIT_LT      4
> +#define I915_UFENCE_WAIT_LTE     5
> +#define I915_UFENCE_WAIT_BEFORE  6
> +#define I915_UFENCE_WAIT_AFTER   7
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_UFENCE_WAIT_SOFT:
> +	 *
> +	 * To be woken up by i915 driver async worker (not by GPU).
> +	 *
> +	 * I915_UFENCE_WAIT_ABSTIME:
> +	 *
> +	 * Wait timeout specified as absolute time.
> +	 */
> +	__u16 flags;
> +#define I915_UFENCE_WAIT_SOFT    0x1
> +#define I915_UFENCE_WAIT_ABSTIME 0x2
> +
> +	/** @value: Wakeup value */
> +	__u64 value;
> +
> +	/** @mask: Wakeup mask */
> +	__u64 mask;
> +#define I915_UFENCE_WAIT_U8     0xffu
> +#define I915_UFENCE_WAIT_U16    0xffffu
> +#define I915_UFENCE_WAIT_U32    0xfffffffful
> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
> +
> +	/**
> +	 * @timeout: Wait timeout in nanoseconds.
> +	 *
> +	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
> +	 * absolute time in nsec.
> +	 */
> +	__s64 timeout;
> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-03  6:53             ` Niranjana Vishwanathapura
@ 2022-06-07 10:42               ` Tvrtko Ursulin
  2022-06-07 21:25                 ` Niranjana Vishwanathapura
  2022-06-08  6:40               ` Lionel Landwerlin
  2022-06-08  7:12               ` Lionel Landwerlin
  2 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-07 10:42 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig


On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>> >> VM_BIND and related uapi definitions
>>>>> >>
>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>> >>     Also add new uapi and documentation as per review comments
>>>>> >>     from Daniel.
>>>>> >>
>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>> <niranjana.vishwanathapura@intel.com>
>>>>> >> ---
>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>> +++++++++++++++++++++++++++
>>>>> >>  1 file changed, 399 insertions(+)
>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >>
>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> new file mode 100644
>>>>> >> index 000000000000..589c0a009107
>>>>> >> --- /dev/null
>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> @@ -0,0 +1,399 @@
>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>> >> +/*
>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>> >> + */
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>> >> + *
>>>>> >> + * VM_BIND feature availability.
>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>> >> + */
>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>> >> + *
>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>> >> + *
>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>> mode of binding.
>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>>>> (ie., the
>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>> be provided
>>>>> >> + * to pass in the batch buffer addresses.
>>>>> >> + *
>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>> must be 0
>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>> must always be
>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>> batch_len fields
>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>> must be 0.
>>>>> >> + */
>>>>> >
>>>>> >From that description, it seems we have:
>>>>> >
>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>> extensions
>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>> actual pointer!
>>>>> >        __u64 flags;                    -> some flags must be 0 (new)
>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>> >        __u64 rsvd2;                    -> unused
>>>>> >};
>>>>> >
>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>>>>> >of adding even more complexity to an already abused interface? While
>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>> >changing how the base struct should be interpreted based on how 
>>>>> the VM
>>>>> >was created (which is an entirely different ioctl).
>>>>> >
>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>> >already at -6 without these changes. I think after vm_bind we'll need
>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>> >
>>>>>
>>>>> The only change here is removing the execlist support for VM_BIND
>>>>> mode (other than natual extensions).
>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>> careful
>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>> future
>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>
>>>> Why not? it's not like adding extensions here is really that different
>>>> than adding new ioctls.
>>>>
>>>> I definitely think this deserves an execbuffer3 without even
>>>> considering future requirements. Just  to burn down the old
>>>> requirements and pointless fields.
>>>>
>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>> older sw on execbuf2 for ever.
>>>
>>> I guess another point in favour of execbuf3 would be that it's less
>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>> needed to cleanly split out the vm_bind paths from the legacy
>>> reloc/softping paths.
>>>
>>> If we invert this and do execbuf3, then there's the existing ioctl
>>> vfunc, and then we share code (where it even makes sense, probably
>>> request setup/submit need to be shared, anything else is probably
>>> cleaner to just copypaste) with the usual helper approach.
>>>
>>> Also that would guarantee that really none of the old concepts like
>>> i915_active on the vma or vma open counts and all that stuff leaks
>>> into the new vm_bind execbuf.
>>>
>>> Finally I also think that copypasting would make backporting easier,
>>> or at least more flexible, since it should make it easier to have the
>>> upstream vm_bind co-exist with all the other things we have. Without
>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>> of vfuncs into the existing code would cause.
>>>
>>> So maybe we should do this?
>>
>> Thanks Dave, Daniel.
>> There are a few things that will be common between execbuf2 and
>> execbuf3, like request setup/submit (as you said), fence handling 
>> (timeline fences, fence array, composite fences), engine selection,
>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>> bit position will differ).
>> But I guess these should be fine as the suggestion here is to
>> copy-paste the execbuff code and having a shared code where possible.
>> Besides, we can stop supporting some older feature in execbuff3
>> (like fence array in favor of newer timeline fences), which will
>> further reduce common code.
>>
>> Ok, I will update this series by adding execbuf3 and send out soon.
>>
> 
> Does this sound reasonable?
> 
> struct drm_i915_gem_execbuffer3 {
>         __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
> 
>         __u32 batch_count;
>         __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
> virtual addresses */

Casual stumble upon..

Alternatively you could embed N pointers to make life a bit easier for 
both userspace and kernel side. Yes, but then "N batch buffers should be 
enough for everyone" problem.. :)

> 
>         __u64 flags;
> #define I915_EXEC3_RING_MASK              (0x3f)
> #define I915_EXEC3_DEFAULT                (0<<0)
> #define I915_EXEC3_RENDER                 (1<<0)
> #define I915_EXEC3_BSD                    (2<<0)
> #define I915_EXEC3_BLT                    (3<<0)
> #define I915_EXEC3_VEBOX                  (4<<0)
> 
> #define I915_EXEC3_SECURE               (1<<6)
> #define I915_EXEC3_IS_PINNED            (1<<7)
> 
> #define I915_EXEC3_BSD_SHIFT     (8)
> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

I'd suggest legacy engine selection is unwanted, especially not with the 
convoluted BSD1/2 flags. Can we just require context with engine map and 
index? Or if default context has to be supported then I'd suggest 
...class_instance for that mode.

> #define I915_EXEC3_FENCE_IN             (1<<10)
> #define I915_EXEC3_FENCE_OUT            (1<<11)
> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)

People are likely to object to submit fence since generic mechanism to 
align submissions was rejected.

> 
>         __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

New ioctl you can afford dedicated fields.

In any case I suggest you involve UMD folks in designing it.

Regards,

Tvrtko

> 
>         __u64 extensions;        /* currently only for 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> };
> 
> With this, user can pass in batch addresses and count directly,
> instead of as an extension (as this rfc series was proposing).
> 
> I have removed many of the flags which were either legacy or not
> applicable to BM_BIND mode.
> I have also removed fence array support (execbuffer2.cliprects_ptr)
> as we have timeline fence array support. Is that fine?
> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
> 
> Any thing else needs to be added or removed?
> 
> Niranjana
> 
>> Niranjana
>>
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-03 23:51               ` Niranjana Vishwanathapura
@ 2022-06-07 17:12                 ` Jason Ekstrand
  -1 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-07 17:12 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Lionel Landwerlin, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 11999 bytes --]

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >
> >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >     <niranjana.vishwanathapura@intel.com> wrote:
> >
> >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
> >       the mapping in an
> >       >> > +async worker. The binding and unbinding will work like a
> special
> >       GPU engine.
> >       >> > +The binding and unbinding operations are serialized and will
> >       wait on specified
> >       >> > +input fences before the operation and will signal the output
> >       fences upon the
> >       >> > +completion of the operation. Due to serialization,
> completion of
> >       an operation
> >       >> > +will also indicate that all previous operations are also
> >       complete.
> >       >>
> >       >> I guess we should avoid saying "will immediately start
> >       binding/unbinding" if
> >       >> there are fences involved.
> >       >>
> >       >> And the fact that it's happening in an async worker seem to
> imply
> >       it's not
> >       >> immediate.
> >       >>
> >
> >       Ok, will fix.
> >       This was added because in earlier design binding was deferred until
> >       next execbuff.
> >       But now it is non-deferred (immediate in that sense). But yah,
> this is
> >       confusing
> >       and will fix it.
> >
> >       >>
> >       >> I have a question on the behavior of the bind operation when no
> >       input fence
> >       >> is provided. Let say I do :
> >       >>
> >       >> VM_BIND (out_fence=fence1)
> >       >>
> >       >> VM_BIND (out_fence=fence2)
> >       >>
> >       >> VM_BIND (out_fence=fence3)
> >       >>
> >       >>
> >       >> In what order are the fences going to be signaled?
> >       >>
> >       >> In the order of VM_BIND ioctls? Or out of order?
> >       >>
> >       >> Because you wrote "serialized I assume it's : in order
> >       >>
> >
> >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
> unbind
> >       will use
> >       the same queue and hence are ordered.
> >
> >       >>
> >       >> One thing I didn't realize is that because we only get one
> >       "VM_BIND" engine,
> >       >> there is a disconnect from the Vulkan specification.
> >       >>
> >       >> In Vulkan VM_BIND operations are serialized but per engine.
> >       >>
> >       >> So you could have something like this :
> >       >>
> >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >       >>
> >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >       >>
> >       >>
> >       >> fence1 is not signaled
> >       >>
> >       >> fence3 is signaled
> >       >>
> >       >> So the second VM_BIND will proceed before the first VM_BIND.
> >       >>
> >       >>
> >       >> I guess we can deal with that scenario in userspace by doing the
> >       wait
> >       >> ourselves in one thread per engines.
> >       >>
> >       >> But then it makes the VM_BIND input fences useless.
> >       >>
> >       >>
> >       >> Daniel : what do you think? Should be rework this or just deal
> with
> >       wait
> >       >> fences in userspace?
> >       >>
> >       >
> >       >My opinion is rework this but make the ordering via an engine
> param
> >       optional.
> >       >
> >       >e.g. A VM can be configured so all binds are ordered within the VM
> >       >
> >       >e.g. A VM can be configured so all binds accept an engine argument
> >       (in
> >       >the case of the i915 likely this is a gem context handle) and
> binds
> >       >ordered with respect to that engine.
> >       >
> >       >This gives UMDs options as the later likely consumes more KMD
> >       resources
> >       >so if a different UMD can live with binds being ordered within
> the VM
> >       >they can use a mode consuming less resources.
> >       >
> >
> >       I think we need to be careful here if we are looking for some out
> of
> >       (submission) order completion of vm_bind/unbind.
> >       In-order completion means, in a batch of binds and unbinds to be
> >       completed in-order, user only needs to specify in-fence for the
> >       first bind/unbind call and the our-fence for the last bind/unbind
> >       call. Also, the VA released by an unbind call can be re-used by
> >       any subsequent bind call in that in-order batch.
> >
> >       These things will break if binding/unbinding were to be allowed to
> >       go out of order (of submission) and user need to be extra careful
> >       not to run into pre-mature triggereing of out-fence and bind
> failing
> >       as VA is still in use etc.
> >
> >       Also, VM_BIND binds the provided mapping on the specified address
> >       space
> >       (VM). So, the uapi is not engine/context specific.
> >
> >       We can however add a 'queue' to the uapi which can be one from the
> >       pre-defined queues,
> >       I915_VM_BIND_QUEUE_0
> >       I915_VM_BIND_QUEUE_1
> >       ...
> >       I915_VM_BIND_QUEUE_(N-1)
> >
> >       KMD will spawn an async work queue for each queue which will only
> >       bind the mappings on that queue in the order of submission.
> >       User can assign the queue to per engine or anything like that.
> >
> >       But again here, user need to be careful and not deadlock these
> >       queues with circular dependency of fences.
> >
> >       I prefer adding this later an as extension based on whether it
> >       is really helping with the implementation.
> >
> >     I can tell you right now that having everything on a single in-order
> >     queue will not get us the perf we want.  What vulkan really wants is
> one
> >     of two things:
> >      1. No implicit ordering of VM_BIND ops.  They just happen in
> whatever
> >     their dependencies are resolved and we ensure ordering ourselves by
> >     having a syncobj in the VkQueue.
> >      2. The ability to create multiple VM_BIND queues.  We need at least
> 2
> >     but I don't see why there needs to be a limit besides the limits the
> >     i915 API already has on the number of engines.  Vulkan could expose
> >     multiple sparse binding queues to the client if it's not arbitrarily
> >     limited.
>
> Thanks Jason, Lionel.
>
> Jason, what are you referring to when you say "limits the i915 API already
> has on the number of engines"? I am not sure if there is such an uapi
> today.
>

There's a limit of something like 64 total engines today based on the
number of bits we can cram into the exec flags in execbuffer2.  I think
someone had an extended version that allowed more but I ripped it out
because no one was using it.  Of course, execbuffer3 might not have that
problem at all.

I am trying to see how many queues we need and don't want it to be
> arbitrarily
> large and unduely blow up memory usage and complexity in i915 driver.
>

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
could imagine a client wanting to create more than 1 sparse queue in which
case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
you allow two, I don't think the complexity is going up by allowing N.  As
for memory usage, creating more queues means more memory.  That's a
trade-off that userspace can make.  Again, the expected number here is 1 or
2 in the vast majority of cases so I don't think you need to worry.


> >     Why?  Because Vulkan has two basic kind of bind operations and we
> don't
> >     want any dependencies between them:
> >      1. Immediate.  These happen right after BO creation or maybe as
> part of
> >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
> >     queue and we don't want them serialized with anything.  To
> synchronize
> >     with submit, we'll have a syncobj in the VkDevice which is signaled
> by
> >     all immediate bind operations and make submits wait on it.
> >      2. Queued (sparse): These happen on a VkQueue which may be the same
> as
> >     a render/compute queue or may be its own queue.  It's up to us what
> we
> >     want to advertise.  From the Vulkan API PoV, this is like any other
> >     queue.  Operations on it wait on and signal semaphores.  If we have a
> >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
> we do
> >     in execbuf().
> >     The important thing is that we don't want one type of operation to
> block
> >     on the other.  If immediate binds are blocking on sparse binds, it's
> >     going to cause over-synchronization issues.
> >     In terms of the internal implementation, I know that there's going
> to be
> >     a lock on the VM and that we can't actually do these things in
> >     parallel.  That's fine.  Once the dma_fences have signaled and we're
>
> Thats correct. It is like a single VM_BIND engine with multiple queues
> feeding to it.
>

Right.  As long as the queues themselves are independent and can block on
dma_fences without holding up other queues, I think we're fine.


> >     unblocked to do the bind operation, I don't care if there's a bit of
> >     synchronization due to locking.  That's expected.  What we can't
> afford
> >     to have is an immediate bind operation suddenly blocking on a sparse
> >     operation which is blocked on a compute job that's going to run for
> >     another 5ms.
>
> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
> on other VMs. I am not sure about usecases here, but just wanted to
> clarify.
>

Yes, that's what I would expect.

--Jason



> Niranjana
>
> >     For reference, Windows solves this by allowing arbitrarily many
> paging
> >     queues (what they call a VM_BIND engine/queue).  That design works
> >     pretty well and solves the problems in question.  Again, we could
> just
> >     make everything out-of-order and require using syncobjs to order
> things
> >     as userspace wants. That'd be fine too.
> >     One more note while I'm here: danvet said something on IRC about
> VM_BIND
> >     queues waiting for syncobjs to materialize.  We don't really
> want/need
> >     this.  We already have all the machinery in userspace to handle
> >     wait-before-signal and waiting for syncobj fences to materialize and
> >     that machinery is on by default.  It would actually take MORE work in
> >     Mesa to turn it off and take advantage of the kernel being able to
> wait
> >     for syncobjs to materialize.  Also, getting that right is
> ridiculously
> >     hard and I really don't want to get it wrong in kernel space.  When
> we
> >     do memory fences, wait-before-signal will be a thing.  We don't need
> to
> >     try and make it a thing for syncobj.
> >     --Jason
> >
> >   Thanks Jason,
> >
> >   I missed the bit in the Vulkan spec that we're allowed to have a sparse
> >   queue that does not implement either graphics or compute operations :
> >
> >     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
> >     support in queue families that also include
> >
> >      graphics and compute support, other implementations may only expose
> a
> >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >
> >      family."
> >
> >   So it can all be all a vm_bind engine that just does bind/unbind
> >   operations.
> >
> >   But yes we need another engine for the immediate/non-sparse operations.
> >
> >   -Lionel
> >
> >
> >
> >       Daniel, any thoughts?
> >
> >       Niranjana
> >
> >       >Matt
> >       >
> >       >>
> >       >> Sorry I noticed this late.
> >       >>
> >       >>
> >       >> -Lionel
> >       >>
> >       >>
>

[-- Attachment #2: Type: text/html, Size: 15906 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-07 17:12                 ` Jason Ekstrand
  0 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-07 17:12 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Daniel Vetter, Christian König

[-- Attachment #1: Type: text/plain, Size: 11999 bytes --]

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >
> >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >     <niranjana.vishwanathapura@intel.com> wrote:
> >
> >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
> >       the mapping in an
> >       >> > +async worker. The binding and unbinding will work like a
> special
> >       GPU engine.
> >       >> > +The binding and unbinding operations are serialized and will
> >       wait on specified
> >       >> > +input fences before the operation and will signal the output
> >       fences upon the
> >       >> > +completion of the operation. Due to serialization,
> completion of
> >       an operation
> >       >> > +will also indicate that all previous operations are also
> >       complete.
> >       >>
> >       >> I guess we should avoid saying "will immediately start
> >       binding/unbinding" if
> >       >> there are fences involved.
> >       >>
> >       >> And the fact that it's happening in an async worker seem to
> imply
> >       it's not
> >       >> immediate.
> >       >>
> >
> >       Ok, will fix.
> >       This was added because in earlier design binding was deferred until
> >       next execbuff.
> >       But now it is non-deferred (immediate in that sense). But yah,
> this is
> >       confusing
> >       and will fix it.
> >
> >       >>
> >       >> I have a question on the behavior of the bind operation when no
> >       input fence
> >       >> is provided. Let say I do :
> >       >>
> >       >> VM_BIND (out_fence=fence1)
> >       >>
> >       >> VM_BIND (out_fence=fence2)
> >       >>
> >       >> VM_BIND (out_fence=fence3)
> >       >>
> >       >>
> >       >> In what order are the fences going to be signaled?
> >       >>
> >       >> In the order of VM_BIND ioctls? Or out of order?
> >       >>
> >       >> Because you wrote "serialized I assume it's : in order
> >       >>
> >
> >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
> unbind
> >       will use
> >       the same queue and hence are ordered.
> >
> >       >>
> >       >> One thing I didn't realize is that because we only get one
> >       "VM_BIND" engine,
> >       >> there is a disconnect from the Vulkan specification.
> >       >>
> >       >> In Vulkan VM_BIND operations are serialized but per engine.
> >       >>
> >       >> So you could have something like this :
> >       >>
> >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >       >>
> >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >       >>
> >       >>
> >       >> fence1 is not signaled
> >       >>
> >       >> fence3 is signaled
> >       >>
> >       >> So the second VM_BIND will proceed before the first VM_BIND.
> >       >>
> >       >>
> >       >> I guess we can deal with that scenario in userspace by doing the
> >       wait
> >       >> ourselves in one thread per engines.
> >       >>
> >       >> But then it makes the VM_BIND input fences useless.
> >       >>
> >       >>
> >       >> Daniel : what do you think? Should be rework this or just deal
> with
> >       wait
> >       >> fences in userspace?
> >       >>
> >       >
> >       >My opinion is rework this but make the ordering via an engine
> param
> >       optional.
> >       >
> >       >e.g. A VM can be configured so all binds are ordered within the VM
> >       >
> >       >e.g. A VM can be configured so all binds accept an engine argument
> >       (in
> >       >the case of the i915 likely this is a gem context handle) and
> binds
> >       >ordered with respect to that engine.
> >       >
> >       >This gives UMDs options as the later likely consumes more KMD
> >       resources
> >       >so if a different UMD can live with binds being ordered within
> the VM
> >       >they can use a mode consuming less resources.
> >       >
> >
> >       I think we need to be careful here if we are looking for some out
> of
> >       (submission) order completion of vm_bind/unbind.
> >       In-order completion means, in a batch of binds and unbinds to be
> >       completed in-order, user only needs to specify in-fence for the
> >       first bind/unbind call and the our-fence for the last bind/unbind
> >       call. Also, the VA released by an unbind call can be re-used by
> >       any subsequent bind call in that in-order batch.
> >
> >       These things will break if binding/unbinding were to be allowed to
> >       go out of order (of submission) and user need to be extra careful
> >       not to run into pre-mature triggereing of out-fence and bind
> failing
> >       as VA is still in use etc.
> >
> >       Also, VM_BIND binds the provided mapping on the specified address
> >       space
> >       (VM). So, the uapi is not engine/context specific.
> >
> >       We can however add a 'queue' to the uapi which can be one from the
> >       pre-defined queues,
> >       I915_VM_BIND_QUEUE_0
> >       I915_VM_BIND_QUEUE_1
> >       ...
> >       I915_VM_BIND_QUEUE_(N-1)
> >
> >       KMD will spawn an async work queue for each queue which will only
> >       bind the mappings on that queue in the order of submission.
> >       User can assign the queue to per engine or anything like that.
> >
> >       But again here, user need to be careful and not deadlock these
> >       queues with circular dependency of fences.
> >
> >       I prefer adding this later an as extension based on whether it
> >       is really helping with the implementation.
> >
> >     I can tell you right now that having everything on a single in-order
> >     queue will not get us the perf we want.  What vulkan really wants is
> one
> >     of two things:
> >      1. No implicit ordering of VM_BIND ops.  They just happen in
> whatever
> >     their dependencies are resolved and we ensure ordering ourselves by
> >     having a syncobj in the VkQueue.
> >      2. The ability to create multiple VM_BIND queues.  We need at least
> 2
> >     but I don't see why there needs to be a limit besides the limits the
> >     i915 API already has on the number of engines.  Vulkan could expose
> >     multiple sparse binding queues to the client if it's not arbitrarily
> >     limited.
>
> Thanks Jason, Lionel.
>
> Jason, what are you referring to when you say "limits the i915 API already
> has on the number of engines"? I am not sure if there is such an uapi
> today.
>

There's a limit of something like 64 total engines today based on the
number of bits we can cram into the exec flags in execbuffer2.  I think
someone had an extended version that allowed more but I ripped it out
because no one was using it.  Of course, execbuffer3 might not have that
problem at all.

I am trying to see how many queues we need and don't want it to be
> arbitrarily
> large and unduely blow up memory usage and complexity in i915 driver.
>

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
could imagine a client wanting to create more than 1 sparse queue in which
case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
you allow two, I don't think the complexity is going up by allowing N.  As
for memory usage, creating more queues means more memory.  That's a
trade-off that userspace can make.  Again, the expected number here is 1 or
2 in the vast majority of cases so I don't think you need to worry.


> >     Why?  Because Vulkan has two basic kind of bind operations and we
> don't
> >     want any dependencies between them:
> >      1. Immediate.  These happen right after BO creation or maybe as
> part of
> >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
> >     queue and we don't want them serialized with anything.  To
> synchronize
> >     with submit, we'll have a syncobj in the VkDevice which is signaled
> by
> >     all immediate bind operations and make submits wait on it.
> >      2. Queued (sparse): These happen on a VkQueue which may be the same
> as
> >     a render/compute queue or may be its own queue.  It's up to us what
> we
> >     want to advertise.  From the Vulkan API PoV, this is like any other
> >     queue.  Operations on it wait on and signal semaphores.  If we have a
> >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
> we do
> >     in execbuf().
> >     The important thing is that we don't want one type of operation to
> block
> >     on the other.  If immediate binds are blocking on sparse binds, it's
> >     going to cause over-synchronization issues.
> >     In terms of the internal implementation, I know that there's going
> to be
> >     a lock on the VM and that we can't actually do these things in
> >     parallel.  That's fine.  Once the dma_fences have signaled and we're
>
> Thats correct. It is like a single VM_BIND engine with multiple queues
> feeding to it.
>

Right.  As long as the queues themselves are independent and can block on
dma_fences without holding up other queues, I think we're fine.


> >     unblocked to do the bind operation, I don't care if there's a bit of
> >     synchronization due to locking.  That's expected.  What we can't
> afford
> >     to have is an immediate bind operation suddenly blocking on a sparse
> >     operation which is blocked on a compute job that's going to run for
> >     another 5ms.
>
> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
> on other VMs. I am not sure about usecases here, but just wanted to
> clarify.
>

Yes, that's what I would expect.

--Jason



> Niranjana
>
> >     For reference, Windows solves this by allowing arbitrarily many
> paging
> >     queues (what they call a VM_BIND engine/queue).  That design works
> >     pretty well and solves the problems in question.  Again, we could
> just
> >     make everything out-of-order and require using syncobjs to order
> things
> >     as userspace wants. That'd be fine too.
> >     One more note while I'm here: danvet said something on IRC about
> VM_BIND
> >     queues waiting for syncobjs to materialize.  We don't really
> want/need
> >     this.  We already have all the machinery in userspace to handle
> >     wait-before-signal and waiting for syncobj fences to materialize and
> >     that machinery is on by default.  It would actually take MORE work in
> >     Mesa to turn it off and take advantage of the kernel being able to
> wait
> >     for syncobjs to materialize.  Also, getting that right is
> ridiculously
> >     hard and I really don't want to get it wrong in kernel space.  When
> we
> >     do memory fences, wait-before-signal will be a thing.  We don't need
> to
> >     try and make it a thing for syncobj.
> >     --Jason
> >
> >   Thanks Jason,
> >
> >   I missed the bit in the Vulkan spec that we're allowed to have a sparse
> >   queue that does not implement either graphics or compute operations :
> >
> >     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
> >     support in queue families that also include
> >
> >      graphics and compute support, other implementations may only expose
> a
> >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >
> >      family."
> >
> >   So it can all be all a vm_bind engine that just does bind/unbind
> >   operations.
> >
> >   But yes we need another engine for the immediate/non-sparse operations.
> >
> >   -Lionel
> >
> >
> >
> >       Daniel, any thoughts?
> >
> >       Niranjana
> >
> >       >Matt
> >       >
> >       >>
> >       >> Sorry I noticed this late.
> >       >>
> >       >>
> >       >> -Lionel
> >       >>
> >       >>
>

[-- Attachment #2: Type: text/html, Size: 15906 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-07 17:12                 ` Jason Ekstrand
@ 2022-06-07 18:18                   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 18:18 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Lionel Landwerlin, Daniel Vetter,
	Christian König

On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>   On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>     >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >
>     >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >     <niranjana.vishwanathapura@intel.com> wrote:
>     >
>     >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>     wrote:
>     >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding
>     >       the mapping in an
>     >       >> > +async worker. The binding and unbinding will work like a
>     special
>     >       GPU engine.
>     >       >> > +The binding and unbinding operations are serialized and
>     will
>     >       wait on specified
>     >       >> > +input fences before the operation and will signal the
>     output
>     >       fences upon the
>     >       >> > +completion of the operation. Due to serialization,
>     completion of
>     >       an operation
>     >       >> > +will also indicate that all previous operations are also
>     >       complete.
>     >       >>
>     >       >> I guess we should avoid saying "will immediately start
>     >       binding/unbinding" if
>     >       >> there are fences involved.
>     >       >>
>     >       >> And the fact that it's happening in an async worker seem to
>     imply
>     >       it's not
>     >       >> immediate.
>     >       >>
>     >
>     >       Ok, will fix.
>     >       This was added because in earlier design binding was deferred
>     until
>     >       next execbuff.
>     >       But now it is non-deferred (immediate in that sense). But yah,
>     this is
>     >       confusing
>     >       and will fix it.
>     >
>     >       >>
>     >       >> I have a question on the behavior of the bind operation when
>     no
>     >       input fence
>     >       >> is provided. Let say I do :
>     >       >>
>     >       >> VM_BIND (out_fence=fence1)
>     >       >>
>     >       >> VM_BIND (out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (out_fence=fence3)
>     >       >>
>     >       >>
>     >       >> In what order are the fences going to be signaled?
>     >       >>
>     >       >> In the order of VM_BIND ioctls? Or out of order?
>     >       >>
>     >       >> Because you wrote "serialized I assume it's : in order
>     >       >>
>     >
>     >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind
>     >       will use
>     >       the same queue and hence are ordered.
>     >
>     >       >>
>     >       >> One thing I didn't realize is that because we only get one
>     >       "VM_BIND" engine,
>     >       >> there is a disconnect from the Vulkan specification.
>     >       >>
>     >       >> In Vulkan VM_BIND operations are serialized but per engine.
>     >       >>
>     >       >> So you could have something like this :
>     >       >>
>     >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >       >>
>     >       >>
>     >       >> fence1 is not signaled
>     >       >>
>     >       >> fence3 is signaled
>     >       >>
>     >       >> So the second VM_BIND will proceed before the first VM_BIND.
>     >       >>
>     >       >>
>     >       >> I guess we can deal with that scenario in userspace by doing
>     the
>     >       wait
>     >       >> ourselves in one thread per engines.
>     >       >>
>     >       >> But then it makes the VM_BIND input fences useless.
>     >       >>
>     >       >>
>     >       >> Daniel : what do you think? Should be rework this or just
>     deal with
>     >       wait
>     >       >> fences in userspace?
>     >       >>
>     >       >
>     >       >My opinion is rework this but make the ordering via an engine
>     param
>     >       optional.
>     >       >
>     >       >e.g. A VM can be configured so all binds are ordered within the
>     VM
>     >       >
>     >       >e.g. A VM can be configured so all binds accept an engine
>     argument
>     >       (in
>     >       >the case of the i915 likely this is a gem context handle) and
>     binds
>     >       >ordered with respect to that engine.
>     >       >
>     >       >This gives UMDs options as the later likely consumes more KMD
>     >       resources
>     >       >so if a different UMD can live with binds being ordered within
>     the VM
>     >       >they can use a mode consuming less resources.
>     >       >
>     >
>     >       I think we need to be careful here if we are looking for some
>     out of
>     >       (submission) order completion of vm_bind/unbind.
>     >       In-order completion means, in a batch of binds and unbinds to be
>     >       completed in-order, user only needs to specify in-fence for the
>     >       first bind/unbind call and the our-fence for the last
>     bind/unbind
>     >       call. Also, the VA released by an unbind call can be re-used by
>     >       any subsequent bind call in that in-order batch.
>     >
>     >       These things will break if binding/unbinding were to be allowed
>     to
>     >       go out of order (of submission) and user need to be extra
>     careful
>     >       not to run into pre-mature triggereing of out-fence and bind
>     failing
>     >       as VA is still in use etc.
>     >
>     >       Also, VM_BIND binds the provided mapping on the specified
>     address
>     >       space
>     >       (VM). So, the uapi is not engine/context specific.
>     >
>     >       We can however add a 'queue' to the uapi which can be one from
>     the
>     >       pre-defined queues,
>     >       I915_VM_BIND_QUEUE_0
>     >       I915_VM_BIND_QUEUE_1
>     >       ...
>     >       I915_VM_BIND_QUEUE_(N-1)
>     >
>     >       KMD will spawn an async work queue for each queue which will
>     only
>     >       bind the mappings on that queue in the order of submission.
>     >       User can assign the queue to per engine or anything like that.
>     >
>     >       But again here, user need to be careful and not deadlock these
>     >       queues with circular dependency of fences.
>     >
>     >       I prefer adding this later an as extension based on whether it
>     >       is really helping with the implementation.
>     >
>     >     I can tell you right now that having everything on a single
>     in-order
>     >     queue will not get us the perf we want.  What vulkan really wants
>     is one
>     >     of two things:
>     >      1. No implicit ordering of VM_BIND ops.  They just happen in
>     whatever
>     >     their dependencies are resolved and we ensure ordering ourselves
>     by
>     >     having a syncobj in the VkQueue.
>     >      2. The ability to create multiple VM_BIND queues.  We need at
>     least 2
>     >     but I don't see why there needs to be a limit besides the limits
>     the
>     >     i915 API already has on the number of engines.  Vulkan could
>     expose
>     >     multiple sparse binding queues to the client if it's not
>     arbitrarily
>     >     limited.
>
>     Thanks Jason, Lionel.
>
>     Jason, what are you referring to when you say "limits the i915 API
>     already
>     has on the number of engines"? I am not sure if there is such an uapi
>     today.
>
>   There's a limit of something like 64 total engines today based on the
>   number of bits we can cram into the exec flags in execbuffer2.  I think
>   someone had an extended version that allowed more but I ripped it out
>   because no one was using it.  Of course, execbuffer3 might not have that
>   problem at all.
>

Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.

>     I am trying to see how many queues we need and don't want it to be
>     arbitrarily
>     large and unduely blow up memory usage and complexity in i915 driver.
>
>   I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>   could imagine a client wanting to create more than 1 sparse queue in which
>   case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>   you allow two, I don't think the complexity is going up by allowing N.  As
>   for memory usage, creating more queues means more memory.  That's a
>   trade-off that userspace can make.  Again, the expected number here is 1
>   or 2 in the vast majority of cases so I don't think you need to worry.
>    

Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.

Niranjana

>
>     >     Why?  Because Vulkan has two basic kind of bind operations and we
>     don't
>     >     want any dependencies between them:
>     >      1. Immediate.  These happen right after BO creation or maybe as
>     part of
>     >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>     on a
>     >     queue and we don't want them serialized with anything.  To
>     synchronize
>     >     with submit, we'll have a syncobj in the VkDevice which is
>     signaled by
>     >     all immediate bind operations and make submits wait on it.
>     >      2. Queued (sparse): These happen on a VkQueue which may be the
>     same as
>     >     a render/compute queue or may be its own queue.  It's up to us
>     what we
>     >     want to advertise.  From the Vulkan API PoV, this is like any
>     other
>     >     queue.  Operations on it wait on and signal semaphores.  If we
>     have a
>     >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>     we do
>     >     in execbuf().
>     >     The important thing is that we don't want one type of operation to
>     block
>     >     on the other.  If immediate binds are blocking on sparse binds,
>     it's
>     >     going to cause over-synchronization issues.
>     >     In terms of the internal implementation, I know that there's going
>     to be
>     >     a lock on the VM and that we can't actually do these things in
>     >     parallel.  That's fine.  Once the dma_fences have signaled and
>     we're
>
>     Thats correct. It is like a single VM_BIND engine with multiple queues
>     feeding to it.
>
>   Right.  As long as the queues themselves are independent and can block on
>   dma_fences without holding up other queues, I think we're fine.
>    
>
>     >     unblocked to do the bind operation, I don't care if there's a bit
>     of
>     >     synchronization due to locking.  That's expected.  What we can't
>     afford
>     >     to have is an immediate bind operation suddenly blocking on a
>     sparse
>     >     operation which is blocked on a compute job that's going to run
>     for
>     >     another 5ms.
>
>     As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>     VM_BIND
>     on other VMs. I am not sure about usecases here, but just wanted to
>     clarify.
>
>   Yes, that's what I would expect.
>   --Jason
>    
>
>     Niranjana
>
>     >     For reference, Windows solves this by allowing arbitrarily many
>     paging
>     >     queues (what they call a VM_BIND engine/queue).  That design works
>     >     pretty well and solves the problems in question.  Again, we could
>     just
>     >     make everything out-of-order and require using syncobjs to order
>     things
>     >     as userspace wants. That'd be fine too.
>     >     One more note while I'm here: danvet said something on IRC about
>     VM_BIND
>     >     queues waiting for syncobjs to materialize.  We don't really
>     want/need
>     >     this.  We already have all the machinery in userspace to handle
>     >     wait-before-signal and waiting for syncobj fences to materialize
>     and
>     >     that machinery is on by default.  It would actually take MORE work
>     in
>     >     Mesa to turn it off and take advantage of the kernel being able to
>     wait
>     >     for syncobjs to materialize.  Also, getting that right is
>     ridiculously
>     >     hard and I really don't want to get it wrong in kernel space. 
>     When we
>     >     do memory fences, wait-before-signal will be a thing.  We don't
>     need to
>     >     try and make it a thing for syncobj.
>     >     --Jason
>     >
>     >   Thanks Jason,
>     >
>     >   I missed the bit in the Vulkan spec that we're allowed to have a
>     sparse
>     >   queue that does not implement either graphics or compute operations
>     :
>     >
>     >     "While some implementations may include
>     VK_QUEUE_SPARSE_BINDING_BIT
>     >     support in queue families that also include
>     >
>     >      graphics and compute support, other implementations may only
>     expose a
>     >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >
>     >      family."
>     >
>     >   So it can all be all a vm_bind engine that just does bind/unbind
>     >   operations.
>     >
>     >   But yes we need another engine for the immediate/non-sparse
>     operations.
>     >
>     >   -Lionel
>     >
>     >     
>     >
>     >       Daniel, any thoughts?
>     >
>     >       Niranjana
>     >
>     >       >Matt
>     >       >
>     >       >>
>     >       >> Sorry I noticed this late.
>     >       >>
>     >       >>
>     >       >> -Lionel
>     >       >>
>     >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-07 18:18                   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 18:18 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Daniel Vetter, Christian König

On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>   On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>     >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >
>     >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >     <niranjana.vishwanathapura@intel.com> wrote:
>     >
>     >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>     wrote:
>     >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding
>     >       the mapping in an
>     >       >> > +async worker. The binding and unbinding will work like a
>     special
>     >       GPU engine.
>     >       >> > +The binding and unbinding operations are serialized and
>     will
>     >       wait on specified
>     >       >> > +input fences before the operation and will signal the
>     output
>     >       fences upon the
>     >       >> > +completion of the operation. Due to serialization,
>     completion of
>     >       an operation
>     >       >> > +will also indicate that all previous operations are also
>     >       complete.
>     >       >>
>     >       >> I guess we should avoid saying "will immediately start
>     >       binding/unbinding" if
>     >       >> there are fences involved.
>     >       >>
>     >       >> And the fact that it's happening in an async worker seem to
>     imply
>     >       it's not
>     >       >> immediate.
>     >       >>
>     >
>     >       Ok, will fix.
>     >       This was added because in earlier design binding was deferred
>     until
>     >       next execbuff.
>     >       But now it is non-deferred (immediate in that sense). But yah,
>     this is
>     >       confusing
>     >       and will fix it.
>     >
>     >       >>
>     >       >> I have a question on the behavior of the bind operation when
>     no
>     >       input fence
>     >       >> is provided. Let say I do :
>     >       >>
>     >       >> VM_BIND (out_fence=fence1)
>     >       >>
>     >       >> VM_BIND (out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (out_fence=fence3)
>     >       >>
>     >       >>
>     >       >> In what order are the fences going to be signaled?
>     >       >>
>     >       >> In the order of VM_BIND ioctls? Or out of order?
>     >       >>
>     >       >> Because you wrote "serialized I assume it's : in order
>     >       >>
>     >
>     >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind
>     >       will use
>     >       the same queue and hence are ordered.
>     >
>     >       >>
>     >       >> One thing I didn't realize is that because we only get one
>     >       "VM_BIND" engine,
>     >       >> there is a disconnect from the Vulkan specification.
>     >       >>
>     >       >> In Vulkan VM_BIND operations are serialized but per engine.
>     >       >>
>     >       >> So you could have something like this :
>     >       >>
>     >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >       >>
>     >       >>
>     >       >> fence1 is not signaled
>     >       >>
>     >       >> fence3 is signaled
>     >       >>
>     >       >> So the second VM_BIND will proceed before the first VM_BIND.
>     >       >>
>     >       >>
>     >       >> I guess we can deal with that scenario in userspace by doing
>     the
>     >       wait
>     >       >> ourselves in one thread per engines.
>     >       >>
>     >       >> But then it makes the VM_BIND input fences useless.
>     >       >>
>     >       >>
>     >       >> Daniel : what do you think? Should be rework this or just
>     deal with
>     >       wait
>     >       >> fences in userspace?
>     >       >>
>     >       >
>     >       >My opinion is rework this but make the ordering via an engine
>     param
>     >       optional.
>     >       >
>     >       >e.g. A VM can be configured so all binds are ordered within the
>     VM
>     >       >
>     >       >e.g. A VM can be configured so all binds accept an engine
>     argument
>     >       (in
>     >       >the case of the i915 likely this is a gem context handle) and
>     binds
>     >       >ordered with respect to that engine.
>     >       >
>     >       >This gives UMDs options as the later likely consumes more KMD
>     >       resources
>     >       >so if a different UMD can live with binds being ordered within
>     the VM
>     >       >they can use a mode consuming less resources.
>     >       >
>     >
>     >       I think we need to be careful here if we are looking for some
>     out of
>     >       (submission) order completion of vm_bind/unbind.
>     >       In-order completion means, in a batch of binds and unbinds to be
>     >       completed in-order, user only needs to specify in-fence for the
>     >       first bind/unbind call and the our-fence for the last
>     bind/unbind
>     >       call. Also, the VA released by an unbind call can be re-used by
>     >       any subsequent bind call in that in-order batch.
>     >
>     >       These things will break if binding/unbinding were to be allowed
>     to
>     >       go out of order (of submission) and user need to be extra
>     careful
>     >       not to run into pre-mature triggereing of out-fence and bind
>     failing
>     >       as VA is still in use etc.
>     >
>     >       Also, VM_BIND binds the provided mapping on the specified
>     address
>     >       space
>     >       (VM). So, the uapi is not engine/context specific.
>     >
>     >       We can however add a 'queue' to the uapi which can be one from
>     the
>     >       pre-defined queues,
>     >       I915_VM_BIND_QUEUE_0
>     >       I915_VM_BIND_QUEUE_1
>     >       ...
>     >       I915_VM_BIND_QUEUE_(N-1)
>     >
>     >       KMD will spawn an async work queue for each queue which will
>     only
>     >       bind the mappings on that queue in the order of submission.
>     >       User can assign the queue to per engine or anything like that.
>     >
>     >       But again here, user need to be careful and not deadlock these
>     >       queues with circular dependency of fences.
>     >
>     >       I prefer adding this later an as extension based on whether it
>     >       is really helping with the implementation.
>     >
>     >     I can tell you right now that having everything on a single
>     in-order
>     >     queue will not get us the perf we want.  What vulkan really wants
>     is one
>     >     of two things:
>     >      1. No implicit ordering of VM_BIND ops.  They just happen in
>     whatever
>     >     their dependencies are resolved and we ensure ordering ourselves
>     by
>     >     having a syncobj in the VkQueue.
>     >      2. The ability to create multiple VM_BIND queues.  We need at
>     least 2
>     >     but I don't see why there needs to be a limit besides the limits
>     the
>     >     i915 API already has on the number of engines.  Vulkan could
>     expose
>     >     multiple sparse binding queues to the client if it's not
>     arbitrarily
>     >     limited.
>
>     Thanks Jason, Lionel.
>
>     Jason, what are you referring to when you say "limits the i915 API
>     already
>     has on the number of engines"? I am not sure if there is such an uapi
>     today.
>
>   There's a limit of something like 64 total engines today based on the
>   number of bits we can cram into the exec flags in execbuffer2.  I think
>   someone had an extended version that allowed more but I ripped it out
>   because no one was using it.  Of course, execbuffer3 might not have that
>   problem at all.
>

Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.

>     I am trying to see how many queues we need and don't want it to be
>     arbitrarily
>     large and unduely blow up memory usage and complexity in i915 driver.
>
>   I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>   could imagine a client wanting to create more than 1 sparse queue in which
>   case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>   you allow two, I don't think the complexity is going up by allowing N.  As
>   for memory usage, creating more queues means more memory.  That's a
>   trade-off that userspace can make.  Again, the expected number here is 1
>   or 2 in the vast majority of cases so I don't think you need to worry.
>    

Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.

Niranjana

>
>     >     Why?  Because Vulkan has two basic kind of bind operations and we
>     don't
>     >     want any dependencies between them:
>     >      1. Immediate.  These happen right after BO creation or maybe as
>     part of
>     >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>     on a
>     >     queue and we don't want them serialized with anything.  To
>     synchronize
>     >     with submit, we'll have a syncobj in the VkDevice which is
>     signaled by
>     >     all immediate bind operations and make submits wait on it.
>     >      2. Queued (sparse): These happen on a VkQueue which may be the
>     same as
>     >     a render/compute queue or may be its own queue.  It's up to us
>     what we
>     >     want to advertise.  From the Vulkan API PoV, this is like any
>     other
>     >     queue.  Operations on it wait on and signal semaphores.  If we
>     have a
>     >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>     we do
>     >     in execbuf().
>     >     The important thing is that we don't want one type of operation to
>     block
>     >     on the other.  If immediate binds are blocking on sparse binds,
>     it's
>     >     going to cause over-synchronization issues.
>     >     In terms of the internal implementation, I know that there's going
>     to be
>     >     a lock on the VM and that we can't actually do these things in
>     >     parallel.  That's fine.  Once the dma_fences have signaled and
>     we're
>
>     Thats correct. It is like a single VM_BIND engine with multiple queues
>     feeding to it.
>
>   Right.  As long as the queues themselves are independent and can block on
>   dma_fences without holding up other queues, I think we're fine.
>    
>
>     >     unblocked to do the bind operation, I don't care if there's a bit
>     of
>     >     synchronization due to locking.  That's expected.  What we can't
>     afford
>     >     to have is an immediate bind operation suddenly blocking on a
>     sparse
>     >     operation which is blocked on a compute job that's going to run
>     for
>     >     another 5ms.
>
>     As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>     VM_BIND
>     on other VMs. I am not sure about usecases here, but just wanted to
>     clarify.
>
>   Yes, that's what I would expect.
>   --Jason
>    
>
>     Niranjana
>
>     >     For reference, Windows solves this by allowing arbitrarily many
>     paging
>     >     queues (what they call a VM_BIND engine/queue).  That design works
>     >     pretty well and solves the problems in question.  Again, we could
>     just
>     >     make everything out-of-order and require using syncobjs to order
>     things
>     >     as userspace wants. That'd be fine too.
>     >     One more note while I'm here: danvet said something on IRC about
>     VM_BIND
>     >     queues waiting for syncobjs to materialize.  We don't really
>     want/need
>     >     this.  We already have all the machinery in userspace to handle
>     >     wait-before-signal and waiting for syncobj fences to materialize
>     and
>     >     that machinery is on by default.  It would actually take MORE work
>     in
>     >     Mesa to turn it off and take advantage of the kernel being able to
>     wait
>     >     for syncobjs to materialize.  Also, getting that right is
>     ridiculously
>     >     hard and I really don't want to get it wrong in kernel space. 
>     When we
>     >     do memory fences, wait-before-signal will be a thing.  We don't
>     need to
>     >     try and make it a thing for syncobj.
>     >     --Jason
>     >
>     >   Thanks Jason,
>     >
>     >   I missed the bit in the Vulkan spec that we're allowed to have a
>     sparse
>     >   queue that does not implement either graphics or compute operations
>     :
>     >
>     >     "While some implementations may include
>     VK_QUEUE_SPARSE_BINDING_BIT
>     >     support in queue families that also include
>     >
>     >      graphics and compute support, other implementations may only
>     expose a
>     >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >
>     >      family."
>     >
>     >   So it can all be all a vm_bind engine that just does bind/unbind
>     >   operations.
>     >
>     >   But yes we need another engine for the immediate/non-sparse
>     operations.
>     >
>     >   -Lionel
>     >
>     >     
>     >
>     >       Daniel, any thoughts?
>     >
>     >       Niranjana
>     >
>     >       >Matt
>     >       >
>     >       >>
>     >       >> Sorry I noticed this late.
>     >       >>
>     >       >>
>     >       >> -Lionel
>     >       >>
>     >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 10:27   ` Tvrtko Ursulin
@ 2022-06-07 19:37     ` Niranjana Vishwanathapura
  2022-06-08  7:17       ` Tvrtko Ursulin
  0 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 19:37 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>
>On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>VM_BIND and related uapi definitions
>>
>>v2: Ensure proper kernel-doc formatting with cross references.
>>     Also add new uapi and documentation as per review comments
>>     from Daniel.
>>
>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>>new file mode 100644
>>index 000000000000..589c0a009107
>>--- /dev/null
>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>@@ -0,0 +1,399 @@
>>+/* SPDX-License-Identifier: MIT */
>>+/*
>>+ * Copyright © 2022 Intel Corporation
>>+ */
>>+
>>+/**
>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>+ *
>>+ * VM_BIND feature availability.
>>+ * See typedef drm_i915_getparam_t param.
>>+ */
>>+#define I915_PARAM_HAS_VM_BIND		57
>>+
>>+/**
>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>+ *
>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>+ * See struct drm_i915_gem_vm_control flags.
>>+ *
>>+ * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>>+ * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>>+ * to pass in the batch buffer addresses.
>>+ *
>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>+ * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>>+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>+ */
>>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
>>+
>>+/**
>>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>+ *
>>+ * Flag to declare context as long running.
>>+ * See struct drm_i915_gem_context_create_ext flags.
>>+ *
>>+ * Usage of dma-fence expects that they complete in reasonable amount of time.
>>+ * Compute on the other hand can be long running. Hence it is not appropriate
>>+ * for compute contexts to export request completion dma-fence to user.
>>+ * The dma-fence usage will be limited to in-kernel consumption only.
>>+ * Compute contexts need to use user/memory fence.
>>+ *
>>+ * So, long running contexts do not support output fences. Hence,
>>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
>>+ * to be not used.
>>+ *
>>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
>>+ * to long running contexts.
>>+ */
>>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>+
>>+/* VM_BIND related ioctls */
>>+#define DRM_I915_GEM_VM_BIND		0x3d
>>+#define DRM_I915_GEM_VM_UNBIND		0x3e
>>+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
>>+
>>+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
>>+
>>+/**
>>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>+ *
>>+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
>>+ * virtual address (VA) range to the section of an object that should be bound
>>+ * in the device page table of the specified address space (VM).
>>+ * The VA range specified must be unique (ie., not currently bound) and can
>>+ * be mapped to whole object or a section of the object (partial binding).
>>+ * Multiple VA mappings can be created to the same section of the object
>>+ * (aliasing).
>>+ */
>>+struct drm_i915_gem_vm_bind {
>>+	/** @vm_id: VM (address space) id to bind */
>>+	__u32 vm_id;
>>+
>>+	/** @handle: Object handle */
>>+	__u32 handle;
>>+
>>+	/** @start: Virtual Address start to bind */
>>+	__u64 start;
>>+
>>+	/** @offset: Offset in object to bind */
>>+	__u64 offset;
>>+
>>+	/** @length: Length of mapping to bind */
>>+	__u64 length;
>
>Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? 
>Or if not userspace is expected to map the remainder of the space to a 
>dummy object? In which case would there be any alignment/padding 
>issues preventing the two bind to be placed next to each other?
>
>I ask because someone from the compute side asked me about a problem 
>with their strategy of dealing with overfetch and I suggested pad to 
>size.
>

Thanks Tvrtko,
I think we shouldn't be needing it. As with VM_BIND VA assignment
is completely pushed to userspace, no padding should be necessary
once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here.
Generally, 'start' and 'size' should be 4K aligned. But, I think
when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
be 64K aligned.

Niranjana

>Regards,
>
>Tvrtko
>
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_GEM_VM_BIND_READONLY:
>>+	 * Mapping is read-only.
>>+	 *
>>+	 * I915_GEM_VM_BIND_CAPTURE:
>>+	 * Capture this mapping in the dump upon GPU error.
>>+	 */
>>+	__u64 flags;
>>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>+
>>+	/** @extensions: 0-terminated chain of extensions for this mapping. */
>>+	__u64 extensions;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>+ *
>>+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
>>+ * address (VA) range that should be unbound from the device page table of the
>>+ * specified address space (VM). The specified VA range must match one of the
>>+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>+ * completion.
>>+ */
>>+struct drm_i915_gem_vm_unbind {
>>+	/** @vm_id: VM (address space) id to bind */
>>+	__u32 vm_id;
>>+
>>+	/** @rsvd: Reserved for future use; must be zero. */
>>+	__u32 rsvd;
>>+
>>+	/** @start: Virtual Address start to unbind */
>>+	__u64 start;
>>+
>>+	/** @length: Length of mapping to unbind */
>>+	__u64 length;
>>+
>>+	/** @flags: reserved for future usage, currently MBZ */
>>+	__u64 flags;
>>+
>>+	/** @extensions: 0-terminated chain of extensions for this mapping. */
>>+	__u64 extensions;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
>>+ * or the vm_unbind work.
>>+ *
>>+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
>>+ * before starting the binding or unbinding.
>>+ *
>>+ * The vm_bind or vm_unbind async worker will signal the returned output fence
>>+ * after the completion of binding or unbinding.
>>+ */
>>+struct drm_i915_vm_bind_fence {
>>+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
>>+	__u32 handle;
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_VM_BIND_FENCE_WAIT:
>>+	 * Wait for the input fence before binding/unbinding
>>+	 *
>>+	 * I915_VM_BIND_FENCE_SIGNAL:
>>+	 * Return bind/unbind completion fence as output
>>+	 */
>>+	__u32 flags;
>>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
>>+ * and vm_unbind.
>>+ *
>>+ * This structure describes an array of timeline drm_syncobj and associated
>>+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
>>+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>+ */
>>+struct drm_i915_vm_bind_ext_timeline_fences {
>>+#define I915_VM_BIND_EXT_timeline_FENCES	0
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/**
>>+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
>>+	 * arrays.
>>+	 */
>>+	__u64 fence_count;
>>+
>>+	/**
>>+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
>>+	 * of length @fence_count.
>>+	 */
>>+	__u64 handles_ptr;
>>+
>>+	/**
>>+	 * @values_ptr: Pointer to an array of u64 values of length
>>+	 * @fence_count.
>>+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>>+	 * binary one.
>>+	 */
>>+	__u64 values_ptr;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
>>+ * vm_bind or the vm_unbind work.
>>+ *
>>+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
>>+ * @addr to become equal to @val) before starting the binding or unbinding.
>>+ *
>>+ * The vm_bind or vm_unbind async worker will signal the output fence after
>>+ * the completion of binding or unbinding by writing @val to memory location at
>>+ * @addr
>>+ */
>>+struct drm_i915_vm_bind_user_fence {
>>+	/** @addr: User/Memory fence qword aligned process virtual address */
>>+	__u64 addr;
>>+
>>+	/** @val: User/Memory fence value to be written after bind completion */
>>+	__u64 val;
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_VM_BIND_USER_FENCE_WAIT:
>>+	 * Wait for the input fence before binding/unbinding
>>+	 *
>>+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
>>+	 * Return bind/unbind completion fence as output
>>+	 */
>>+	__u32 flags;
>>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
>>+ * and vm_unbind.
>>+ *
>>+ * These user fences can be input or output fences
>>+ * (See struct drm_i915_vm_bind_user_fence).
>>+ */
>>+struct drm_i915_vm_bind_ext_user_fence {
>>+#define I915_VM_BIND_EXT_USER_FENCES	1
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
>>+	__u64 fence_count;
>>+
>>+	/**
>>+	 * @user_fence_ptr: Pointer to an array of
>>+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>+	 */
>>+	__u64 user_fence_ptr;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
>>+ * gpu virtual addresses.
>>+ *
>>+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
>>+ * must always be appended in the VM_BIND mode and it will be an error to
>>+ * append this extension in older non-VM_BIND mode.
>>+ */
>>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/** @count: Number of addresses in the addr array. */
>>+	__u32 count;
>>+
>>+	/** @addr: An array of batch gpu virtual addresses. */
>>+	__u64 addr[0];
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
>>+ * signaling extension.
>>+ *
>>+ * This extension allows user to attach a user fence (@addr, @value pair) to an
>>+ * execbuf to be signaled by the command streamer after the completion of first
>>+ * level batch, by writing the @value at specified @addr and triggering an
>>+ * interrupt.
>>+ * User can either poll for this user fence to signal or can also wait on it
>>+ * with i915_gem_wait_user_fence ioctl.
>>+ * This is very much usefaul for long running contexts where waiting on dma-fence
>>+ * by user (like i915_gem_wait ioctl) is not supported.
>>+ */
>>+struct drm_i915_gem_execbuffer_ext_user_fence {
>>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/**
>>+	 * @addr: User/Memory fence qword aligned GPU virtual address.
>>+	 *
>>+	 * Address has to be a valid GPU virtual address at the time of
>>+	 * first level batch completion.
>>+	 */
>>+	__u64 addr;
>>+
>>+	/**
>>+	 * @value: User/Memory fence Value to be written to above address
>>+	 * after first level batch completes.
>>+	 */
>>+	__u64 value;
>>+
>>+	/** @rsvd: Reserved for future extensions, MBZ */
>>+	__u64 rsvd;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
>>+ * private to the specified VM.
>>+ *
>>+ * See struct drm_i915_gem_create_ext.
>>+ */
>>+struct drm_i915_gem_create_ext_vm_private {
>>+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/** @vm_id: Id of the VM to which the object is private */
>>+	__u32 vm_id;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>+ *
>>+ * User/Memory fence can be woken up either by:
>>+ *
>>+ * 1. GPU context indicated by @ctx_id, or,
>>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>+ *    @ctx_id is ignored when this flag is set.
>>+ *
>>+ * Wakeup condition is,
>>+ * ``((*addr & mask) op (value & mask))``
>>+ *
>>+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
>>+ */
>>+struct drm_i915_gem_wait_user_fence {
>>+	/** @extensions: Zero-terminated chain of extensions. */
>>+	__u64 extensions;
>>+
>>+	/** @addr: User/Memory fence address */
>>+	__u64 addr;
>>+
>>+	/** @ctx_id: Id of the Context which will signal the fence. */
>>+	__u32 ctx_id;
>>+
>>+	/** @op: Wakeup condition operator */
>>+	__u16 op;
>>+#define I915_UFENCE_WAIT_EQ      0
>>+#define I915_UFENCE_WAIT_NEQ     1
>>+#define I915_UFENCE_WAIT_GT      2
>>+#define I915_UFENCE_WAIT_GTE     3
>>+#define I915_UFENCE_WAIT_LT      4
>>+#define I915_UFENCE_WAIT_LTE     5
>>+#define I915_UFENCE_WAIT_BEFORE  6
>>+#define I915_UFENCE_WAIT_AFTER   7
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_UFENCE_WAIT_SOFT:
>>+	 *
>>+	 * To be woken up by i915 driver async worker (not by GPU).
>>+	 *
>>+	 * I915_UFENCE_WAIT_ABSTIME:
>>+	 *
>>+	 * Wait timeout specified as absolute time.
>>+	 */
>>+	__u16 flags;
>>+#define I915_UFENCE_WAIT_SOFT    0x1
>>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>>+
>>+	/** @value: Wakeup value */
>>+	__u64 value;
>>+
>>+	/** @mask: Wakeup mask */
>>+	__u64 mask;
>>+#define I915_UFENCE_WAIT_U8     0xffu
>>+#define I915_UFENCE_WAIT_U16    0xffffu
>>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>+
>>+	/**
>>+	 * @timeout: Wait timeout in nanoseconds.
>>+	 *
>>+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
>>+	 * absolute time in nsec.
>>+	 */
>>+	__s64 timeout;
>>+};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 10:42               ` Tvrtko Ursulin
@ 2022-06-07 21:25                 ` Niranjana Vishwanathapura
  2022-06-08  7:34                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 21:25 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
>
>On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
>>On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
>>>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>>>
>>>>>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>
>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>     from Daniel.
>>>>>>>>
>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>> ---
>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>+++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>
>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> new file mode 100644
>>>>>>>> index 000000000000..589c0a009107
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>> +/*
>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>> + *
>>>>>>>> + * VM_BIND feature availability.
>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>> + */
>>>>>>>> +#define I915_PARAM_HAS_VM_BIND               57
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>> + *
>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>> + *
>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>execbuff mode of binding.
>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>execlist (ie., the
>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>must be provided
>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>> + *
>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS 
>>>>>>flag must always be
>>>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>batch_len fields
>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used 
>>>>>>and must be 0.
>>>>>>>> + */
>>>>>>>
>>>>>>>From that description, it seems we have:
>>>>>>>
>>>>>>>struct drm_i915_gem_execbuffer2 {
>>>>>>>        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>        __u32 buffer_count;             -> must be 0 (new)
>>>>>>>        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>        __u32 batch_len;                -> must be 0 (new)
>>>>>>>        __u32 DR1;                      -> must be 0 (old)
>>>>>>>        __u32 DR4;                      -> must be 0 (old)
>>>>>>>        __u32 num_cliprects; (fences)   -> must be 0 since 
>>>>>>using extensions
>>>>>>>        __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>contains an actual pointer!
>>>>>>>        __u64 flags;                    -> some flags must be 0 (new)
>>>>>>>        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>        __u64 rsvd2;                    -> unused
>>>>>>>};
>>>>>>>
>>>>>>>Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>>>>>>>of adding even more complexity to an already abused interface? While
>>>>>>>the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>we're doing here is extending the ioctl usage, we're completely
>>>>>>>changing how the base struct should be interpreted based on 
>>>>>>how the VM
>>>>>>>was created (which is an entirely different ioctl).
>>>>>>>
>>>>>>>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>>>already at -6 without these changes. I think after vm_bind we'll need
>>>>>>>to create a -11 entry just to deal with this ioctl.
>>>>>>>
>>>>>>
>>>>>>The only change here is removing the execlist support for VM_BIND
>>>>>>mode (other than natual extensions).
>>>>>>Adding a new execbuffer3 was considered, but I think we need 
>>>>>>to be careful
>>>>>>with that as that goes beyond the VM_BIND support, including 
>>>>>>any future
>>>>>>requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>
>>>>>Why not? it's not like adding extensions here is really that different
>>>>>than adding new ioctls.
>>>>>
>>>>>I definitely think this deserves an execbuffer3 without even
>>>>>considering future requirements. Just  to burn down the old
>>>>>requirements and pointless fields.
>>>>>
>>>>>Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>>>older sw on execbuf2 for ever.
>>>>
>>>>I guess another point in favour of execbuf3 would be that it's less
>>>>midlayer. If we share the entry point then there's quite a few vfuncs
>>>>needed to cleanly split out the vm_bind paths from the legacy
>>>>reloc/softping paths.
>>>>
>>>>If we invert this and do execbuf3, then there's the existing ioctl
>>>>vfunc, and then we share code (where it even makes sense, probably
>>>>request setup/submit need to be shared, anything else is probably
>>>>cleaner to just copypaste) with the usual helper approach.
>>>>
>>>>Also that would guarantee that really none of the old concepts like
>>>>i915_active on the vma or vma open counts and all that stuff leaks
>>>>into the new vm_bind execbuf.
>>>>
>>>>Finally I also think that copypasting would make backporting easier,
>>>>or at least more flexible, since it should make it easier to have the
>>>>upstream vm_bind co-exist with all the other things we have. Without
>>>>huge amounts of conflicts (or at least much less) that pushing a pile
>>>>of vfuncs into the existing code would cause.
>>>>
>>>>So maybe we should do this?
>>>
>>>Thanks Dave, Daniel.
>>>There are a few things that will be common between execbuf2 and
>>>execbuf3, like request setup/submit (as you said), fence handling 
>>>(timeline fences, fence array, composite fences), engine 
>>>selection,
>>>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>bit position will differ).
>>>But I guess these should be fine as the suggestion here is to
>>>copy-paste the execbuff code and having a shared code where possible.
>>>Besides, we can stop supporting some older feature in execbuff3
>>>(like fence array in favor of newer timeline fences), which will
>>>further reduce common code.
>>>
>>>Ok, I will update this series by adding execbuf3 and send out soon.
>>>
>>
>>Does this sound reasonable?
>>
>>struct drm_i915_gem_execbuffer3 {
>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>
>>        __u32 batch_count;
>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>>virtual addresses */
>
>Casual stumble upon..
>
>Alternatively you could embed N pointers to make life a bit easier for 
>both userspace and kernel side. Yes, but then "N batch buffers should 
>be enough for everyone" problem.. :)
>

Thanks Tvrtko,
Yes, hence the batch_addr_ptr.

>>
>>        __u64 flags;
>>#define I915_EXEC3_RING_MASK              (0x3f)
>>#define I915_EXEC3_DEFAULT                (0<<0)
>>#define I915_EXEC3_RENDER                 (1<<0)
>>#define I915_EXEC3_BSD                    (2<<0)
>>#define I915_EXEC3_BLT                    (3<<0)
>>#define I915_EXEC3_VEBOX                  (4<<0)
>>
>>#define I915_EXEC3_SECURE               (1<<6)
>>#define I915_EXEC3_IS_PINNED            (1<<7)
>>
>>#define I915_EXEC3_BSD_SHIFT     (8)
>>#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>
>I'd suggest legacy engine selection is unwanted, especially not with 
>the convoluted BSD1/2 flags. Can we just require context with engine 
>map and index? Or if default context has to be supported then I'd 
>suggest ...class_instance for that mode.
>

Ok, I will be happy to remove it and only support contexts with
engine map, if UMDs agree on that.

>>#define I915_EXEC3_FENCE_IN             (1<<10)
>>#define I915_EXEC3_FENCE_OUT            (1<<11)
>>#define I915_EXEC3_FENCE_SUBMIT         (1<<12)
>
>People are likely to object to submit fence since generic mechanism to 
>align submissions was rejected.
>

Ok, again, I can remove it if UMDs are ok with it.

>>
>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>
>New ioctl you can afford dedicated fields.
>

Yes, but as I asked below, I am not sure if we need this or the
timeline fence arry extension we have is good enough.

>In any case I suggest you involve UMD folks in designing it.
>

Yah.
Paulo, Lionel, Jason, Daniel, can you comment on these regarding
what will UMD need in execbuf3 and what can be removed?

Thanks,
Niranjana

>Regards,
>
>Tvrtko
>
>>
>>        __u64 extensions;        /* currently only for 
>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>};
>>
>>With this, user can pass in batch addresses and count directly,
>>instead of as an extension (as this rfc series was proposing).
>>
>>I have removed many of the flags which were either legacy or not
>>applicable to BM_BIND mode.
>>I have also removed fence array support (execbuffer2.cliprects_ptr)
>>as we have timeline fence array support. Is that fine?
>>Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>
>>Any thing else needs to be added or removed?
>>
>>Niranjana
>>
>>>Niranjana
>>>
>>>>-Daniel
>>>>-- 
>>>>Daniel Vetter
>>>>Software Engineer, Intel Corporation
>>>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-07 18:18                   ` Niranjana Vishwanathapura
  (?)
@ 2022-06-07 21:32                   ` Niranjana Vishwanathapura
  2022-06-08  7:33                     ` Tvrtko Ursulin
  -1 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 21:32 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>  On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>  <niranjana.vishwanathapura@intel.com> wrote:
>>
>>    On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>    >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>    >
>>    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>    >     <niranjana.vishwanathapura@intel.com> wrote:
>>    >
>>    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>>    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>    wrote:
>>    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>    >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>    binding/unbinding
>>    >       the mapping in an
>>    >       >> > +async worker. The binding and unbinding will work like a
>>    special
>>    >       GPU engine.
>>    >       >> > +The binding and unbinding operations are serialized and
>>    will
>>    >       wait on specified
>>    >       >> > +input fences before the operation and will signal the
>>    output
>>    >       fences upon the
>>    >       >> > +completion of the operation. Due to serialization,
>>    completion of
>>    >       an operation
>>    >       >> > +will also indicate that all previous operations are also
>>    >       complete.
>>    >       >>
>>    >       >> I guess we should avoid saying "will immediately start
>>    >       binding/unbinding" if
>>    >       >> there are fences involved.
>>    >       >>
>>    >       >> And the fact that it's happening in an async worker seem to
>>    imply
>>    >       it's not
>>    >       >> immediate.
>>    >       >>
>>    >
>>    >       Ok, will fix.
>>    >       This was added because in earlier design binding was deferred
>>    until
>>    >       next execbuff.
>>    >       But now it is non-deferred (immediate in that sense). But yah,
>>    this is
>>    >       confusing
>>    >       and will fix it.
>>    >
>>    >       >>
>>    >       >> I have a question on the behavior of the bind operation when
>>    no
>>    >       input fence
>>    >       >> is provided. Let say I do :
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence1)
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence2)
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence3)
>>    >       >>
>>    >       >>
>>    >       >> In what order are the fences going to be signaled?
>>    >       >>
>>    >       >> In the order of VM_BIND ioctls? Or out of order?
>>    >       >>
>>    >       >> Because you wrote "serialized I assume it's : in order
>>    >       >>
>>    >
>>    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>>    unbind
>>    >       will use
>>    >       the same queue and hence are ordered.
>>    >
>>    >       >>
>>    >       >> One thing I didn't realize is that because we only get one
>>    >       "VM_BIND" engine,
>>    >       >> there is a disconnect from the Vulkan specification.
>>    >       >>
>>    >       >> In Vulkan VM_BIND operations are serialized but per engine.
>>    >       >>
>>    >       >> So you could have something like this :
>>    >       >>
>>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>    >       >>
>>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>    >       >>
>>    >       >>
>>    >       >> fence1 is not signaled
>>    >       >>
>>    >       >> fence3 is signaled
>>    >       >>
>>    >       >> So the second VM_BIND will proceed before the first VM_BIND.
>>    >       >>
>>    >       >>
>>    >       >> I guess we can deal with that scenario in userspace by doing
>>    the
>>    >       wait
>>    >       >> ourselves in one thread per engines.
>>    >       >>
>>    >       >> But then it makes the VM_BIND input fences useless.
>>    >       >>
>>    >       >>
>>    >       >> Daniel : what do you think? Should be rework this or just
>>    deal with
>>    >       wait
>>    >       >> fences in userspace?
>>    >       >>
>>    >       >
>>    >       >My opinion is rework this but make the ordering via an engine
>>    param
>>    >       optional.
>>    >       >
>>    >       >e.g. A VM can be configured so all binds are ordered within the
>>    VM
>>    >       >
>>    >       >e.g. A VM can be configured so all binds accept an engine
>>    argument
>>    >       (in
>>    >       >the case of the i915 likely this is a gem context handle) and
>>    binds
>>    >       >ordered with respect to that engine.
>>    >       >
>>    >       >This gives UMDs options as the later likely consumes more KMD
>>    >       resources
>>    >       >so if a different UMD can live with binds being ordered within
>>    the VM
>>    >       >they can use a mode consuming less resources.
>>    >       >
>>    >
>>    >       I think we need to be careful here if we are looking for some
>>    out of
>>    >       (submission) order completion of vm_bind/unbind.
>>    >       In-order completion means, in a batch of binds and unbinds to be
>>    >       completed in-order, user only needs to specify in-fence for the
>>    >       first bind/unbind call and the our-fence for the last
>>    bind/unbind
>>    >       call. Also, the VA released by an unbind call can be re-used by
>>    >       any subsequent bind call in that in-order batch.
>>    >
>>    >       These things will break if binding/unbinding were to be allowed
>>    to
>>    >       go out of order (of submission) and user need to be extra
>>    careful
>>    >       not to run into pre-mature triggereing of out-fence and bind
>>    failing
>>    >       as VA is still in use etc.
>>    >
>>    >       Also, VM_BIND binds the provided mapping on the specified
>>    address
>>    >       space
>>    >       (VM). So, the uapi is not engine/context specific.
>>    >
>>    >       We can however add a 'queue' to the uapi which can be one from
>>    the
>>    >       pre-defined queues,
>>    >       I915_VM_BIND_QUEUE_0
>>    >       I915_VM_BIND_QUEUE_1
>>    >       ...
>>    >       I915_VM_BIND_QUEUE_(N-1)
>>    >
>>    >       KMD will spawn an async work queue for each queue which will
>>    only
>>    >       bind the mappings on that queue in the order of submission.
>>    >       User can assign the queue to per engine or anything like that.
>>    >
>>    >       But again here, user need to be careful and not deadlock these
>>    >       queues with circular dependency of fences.
>>    >
>>    >       I prefer adding this later an as extension based on whether it
>>    >       is really helping with the implementation.
>>    >
>>    >     I can tell you right now that having everything on a single
>>    in-order
>>    >     queue will not get us the perf we want.  What vulkan really wants
>>    is one
>>    >     of two things:
>>    >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>    whatever
>>    >     their dependencies are resolved and we ensure ordering ourselves
>>    by
>>    >     having a syncobj in the VkQueue.
>>    >      2. The ability to create multiple VM_BIND queues.  We need at
>>    least 2
>>    >     but I don't see why there needs to be a limit besides the limits
>>    the
>>    >     i915 API already has on the number of engines.  Vulkan could
>>    expose
>>    >     multiple sparse binding queues to the client if it's not
>>    arbitrarily
>>    >     limited.
>>
>>    Thanks Jason, Lionel.
>>
>>    Jason, what are you referring to when you say "limits the i915 API
>>    already
>>    has on the number of engines"? I am not sure if there is such an uapi
>>    today.
>>
>>  There's a limit of something like 64 total engines today based on the
>>  number of bits we can cram into the exec flags in execbuffer2.  I think
>>  someone had an extended version that allowed more but I ripped it out
>>  because no one was using it.  Of course, execbuffer3 might not have that
>>  problem at all.
>>
>
>Thanks Jason.
>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>and somehow export it to user (I am thinking of embedding it in
>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE   64
        __u32 queue;

I think that will keep things simple.

Niranjana

>
>>    I am trying to see how many queues we need and don't want it to be
>>    arbitrarily
>>    large and unduely blow up memory usage and complexity in i915 driver.
>>
>>  I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>>  could imagine a client wanting to create more than 1 sparse queue in which
>>  case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>>  you allow two, I don't think the complexity is going up by allowing N.  As
>>  for memory usage, creating more queues means more memory.  That's a
>>  trade-off that userspace can make.  Again, the expected number here is 1
>>  or 2 in the vast majority of cases so I don't think you need to worry.
>
>Ok, will start with n=3 meaning 8 queues.
>That would require us create 8 workqueues.
>We can change 'n' later if required.
>
>Niranjana
>
>>
>>    >     Why?  Because Vulkan has two basic kind of bind operations and we
>>    don't
>>    >     want any dependencies between them:
>>    >      1. Immediate.  These happen right after BO creation or maybe as
>>    part of
>>    >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>>    on a
>>    >     queue and we don't want them serialized with anything.  To
>>    synchronize
>>    >     with submit, we'll have a syncobj in the VkDevice which is
>>    signaled by
>>    >     all immediate bind operations and make submits wait on it.
>>    >      2. Queued (sparse): These happen on a VkQueue which may be the
>>    same as
>>    >     a render/compute queue or may be its own queue.  It's up to us
>>    what we
>>    >     want to advertise.  From the Vulkan API PoV, this is like any
>>    other
>>    >     queue.  Operations on it wait on and signal semaphores.  If we
>>    have a
>>    >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>>    we do
>>    >     in execbuf().
>>    >     The important thing is that we don't want one type of operation to
>>    block
>>    >     on the other.  If immediate binds are blocking on sparse binds,
>>    it's
>>    >     going to cause over-synchronization issues.
>>    >     In terms of the internal implementation, I know that there's going
>>    to be
>>    >     a lock on the VM and that we can't actually do these things in
>>    >     parallel.  That's fine.  Once the dma_fences have signaled and
>>    we're
>>
>>    Thats correct. It is like a single VM_BIND engine with multiple queues
>>    feeding to it.
>>
>>  Right.  As long as the queues themselves are independent and can block on
>>  dma_fences without holding up other queues, I think we're fine.
>>
>>    >     unblocked to do the bind operation, I don't care if there's a bit
>>    of
>>    >     synchronization due to locking.  That's expected.  What we can't
>>    afford
>>    >     to have is an immediate bind operation suddenly blocking on a
>>    sparse
>>    >     operation which is blocked on a compute job that's going to run
>>    for
>>    >     another 5ms.
>>
>>    As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>    VM_BIND
>>    on other VMs. I am not sure about usecases here, but just wanted to
>>    clarify.
>>
>>  Yes, that's what I would expect.
>>  --Jason
>>
>>    Niranjana
>>
>>    >     For reference, Windows solves this by allowing arbitrarily many
>>    paging
>>    >     queues (what they call a VM_BIND engine/queue).  That design works
>>    >     pretty well and solves the problems in question.  Again, we could
>>    just
>>    >     make everything out-of-order and require using syncobjs to order
>>    things
>>    >     as userspace wants. That'd be fine too.
>>    >     One more note while I'm here: danvet said something on IRC about
>>    VM_BIND
>>    >     queues waiting for syncobjs to materialize.  We don't really
>>    want/need
>>    >     this.  We already have all the machinery in userspace to handle
>>    >     wait-before-signal and waiting for syncobj fences to materialize
>>    and
>>    >     that machinery is on by default.  It would actually take MORE work
>>    in
>>    >     Mesa to turn it off and take advantage of the kernel being able to
>>    wait
>>    >     for syncobjs to materialize.  Also, getting that right is
>>    ridiculously
>>    >     hard and I really don't want to get it wrong in kernel 
>>space.     When we
>>    >     do memory fences, wait-before-signal will be a thing.  We don't
>>    need to
>>    >     try and make it a thing for syncobj.
>>    >     --Jason
>>    >
>>    >   Thanks Jason,
>>    >
>>    >   I missed the bit in the Vulkan spec that we're allowed to have a
>>    sparse
>>    >   queue that does not implement either graphics or compute operations
>>    :
>>    >
>>    >     "While some implementations may include
>>    VK_QUEUE_SPARSE_BINDING_BIT
>>    >     support in queue families that also include
>>    >
>>    >      graphics and compute support, other implementations may only
>>    expose a
>>    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>    >
>>    >      family."
>>    >
>>    >   So it can all be all a vm_bind engine that just does bind/unbind
>>    >   operations.
>>    >
>>    >   But yes we need another engine for the immediate/non-sparse
>>    operations.
>>    >
>>    >   -Lionel
>>    >
>>    >         >
>>    >       Daniel, any thoughts?
>>    >
>>    >       Niranjana
>>    >
>>    >       >Matt
>>    >       >
>>    >       >>
>>    >       >> Sorry I noticed this late.
>>    >       >>
>>    >       >>
>>    >       >> -Lionel
>>    >       >>
>>    >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-03  6:53             ` Niranjana Vishwanathapura
  2022-06-07 10:42               ` Tvrtko Ursulin
@ 2022-06-08  6:40               ` Lionel Landwerlin
  2022-06-08  6:43                 ` Lionel Landwerlin
  2022-06-08  8:36                 ` Tvrtko Ursulin
  2022-06-08  7:12               ` Lionel Landwerlin
  2 siblings, 2 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  6:40 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
> wrote:
>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>> >> VM_BIND and related uapi definitions
>>>>> >>
>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>> >>     Also add new uapi and documentation as per review comments
>>>>> >>     from Daniel.
>>>>> >>
>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>> <niranjana.vishwanathapura@intel.com>
>>>>> >> ---
>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>> +++++++++++++++++++++++++++
>>>>> >>  1 file changed, 399 insertions(+)
>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >>
>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> new file mode 100644
>>>>> >> index 000000000000..589c0a009107
>>>>> >> --- /dev/null
>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> @@ -0,0 +1,399 @@
>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>> >> +/*
>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>> >> + */
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>> >> + *
>>>>> >> + * VM_BIND feature availability.
>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>> >> + */
>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>> >> + *
>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>> >> + *
>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>> mode of binding.
>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>> execlist (ie., the
>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>> be provided
>>>>> >> + * to pass in the batch buffer addresses.
>>>>> >> + *
>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>> must be 0
>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>> must always be
>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>> batch_len fields
>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>> must be 0.
>>>>> >> + */
>>>>> >
>>>>> >From that description, it seems we have:
>>>>> >
>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>> extensions
>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>> actual pointer!
>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>> (new)
>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>> >        __u64 rsvd2;                    -> unused
>>>>> >};
>>>>> >
>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>> instead
>>>>> >of adding even more complexity to an already abused interface? While
>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>> >changing how the base struct should be interpreted based on how 
>>>>> the VM
>>>>> >was created (which is an entirely different ioctl).
>>>>> >
>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>> need
>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>> >
>>>>>
>>>>> The only change here is removing the execlist support for VM_BIND
>>>>> mode (other than natual extensions).
>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>> careful
>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>> future
>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>
>>>> Why not? it's not like adding extensions here is really that different
>>>> than adding new ioctls.
>>>>
>>>> I definitely think this deserves an execbuffer3 without even
>>>> considering future requirements. Just  to burn down the old
>>>> requirements and pointless fields.
>>>>
>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>> older sw on execbuf2 for ever.
>>>
>>> I guess another point in favour of execbuf3 would be that it's less
>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>> needed to cleanly split out the vm_bind paths from the legacy
>>> reloc/softping paths.
>>>
>>> If we invert this and do execbuf3, then there's the existing ioctl
>>> vfunc, and then we share code (where it even makes sense, probably
>>> request setup/submit need to be shared, anything else is probably
>>> cleaner to just copypaste) with the usual helper approach.
>>>
>>> Also that would guarantee that really none of the old concepts like
>>> i915_active on the vma or vma open counts and all that stuff leaks
>>> into the new vm_bind execbuf.
>>>
>>> Finally I also think that copypasting would make backporting easier,
>>> or at least more flexible, since it should make it easier to have the
>>> upstream vm_bind co-exist with all the other things we have. Without
>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>> of vfuncs into the existing code would cause.
>>>
>>> So maybe we should do this?
>>
>> Thanks Dave, Daniel.
>> There are a few things that will be common between execbuf2 and
>> execbuf3, like request setup/submit (as you said), fence handling 
>> (timeline fences, fence array, composite fences), engine selection,
>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>> bit position will differ).
>> But I guess these should be fine as the suggestion here is to
>> copy-paste the execbuff code and having a shared code where possible.
>> Besides, we can stop supporting some older feature in execbuff3
>> (like fence array in favor of newer timeline fences), which will
>> further reduce common code.
>>
>> Ok, I will update this series by adding execbuf3 and send out soon.
>>
>
> Does this sound reasonable?


Thanks for proposing this. Some comments below.


>
> struct drm_i915_gem_execbuffer3 {
>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>
>        __u32 batch_count;
>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
> virtual addresses */
>
>        __u64 flags;
> #define I915_EXEC3_RING_MASK              (0x3f)
> #define I915_EXEC3_DEFAULT                (0<<0)
> #define I915_EXEC3_RENDER                 (1<<0)
> #define I915_EXEC3_BSD                    (2<<0)
> #define I915_EXEC3_BLT                    (3<<0)
> #define I915_EXEC3_VEBOX                  (4<<0)


Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in 
drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the 
context engine map).


>
> #define I915_EXEC3_SECURE               (1<<6)
> #define I915_EXEC3_IS_PINNED            (1<<7)


What's the meaning of PINNED?


>
> #define I915_EXEC3_BSD_SHIFT     (8)
> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>
> #define I915_EXEC3_FENCE_IN             (1<<10)
> #define I915_EXEC3_FENCE_OUT            (1<<11)


For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 
support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?


> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)


What's FENCE_SUBMIT?


>
>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>
>        __u64 extensions;        /* currently only for 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> };
>
> With this, user can pass in batch addresses and count directly,
> instead of as an extension (as this rfc series was proposing).
>
> I have removed many of the flags which were either legacy or not
> applicable to BM_BIND mode.
> I have also removed fence array support (execbuffer2.cliprects_ptr)
> as we have timeline fence array support. Is that fine?
> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>
> Any thing else needs to be added or removed?
>
> Niranjana
>
>> Niranjana
>>
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  6:40               ` Lionel Landwerlin
@ 2022-06-08  6:43                 ` Lionel Landwerlin
  2022-06-08  8:36                 ` Tvrtko Ursulin
  1 sibling, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  6:43 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 08/06/2022 09:40, Lionel Landwerlin wrote:
> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>> wrote:
>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>
>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>> >> VM_BIND and related uapi definitions
>>>>>> >>
>>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>> >>     Also add new uapi and documentation as per review comments
>>>>>> >>     from Daniel.
>>>>>> >>
>>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>> >> ---
>>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>> +++++++++++++++++++++++++++
>>>>>> >>  1 file changed, 399 insertions(+)
>>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >>
>>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >> new file mode 100644
>>>>>> >> index 000000000000..589c0a009107
>>>>>> >> --- /dev/null
>>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >> @@ -0,0 +1,399 @@
>>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>>> >> +/*
>>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>>> >> + */
>>>>>> >> +
>>>>>> >> +/**
>>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>> >> + *
>>>>>> >> + * VM_BIND feature availability.
>>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>>> >> + */
>>>>>> >> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>> >> +
>>>>>> >> +/**
>>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>> >> + *
>>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM 
>>>>>> creation.
>>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>>> >> + *
>>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>> mode of binding.
>>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>> execlist (ie., the
>>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>> be provided
>>>>>> >> + * to pass in the batch buffer addresses.
>>>>>> >> + *
>>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>> must be 0
>>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>> must always be
>>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>> batch_len fields
>>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>> must be 0.
>>>>>> >> + */
>>>>>> >
>>>>>> >From that description, it seems we have:
>>>>>> >
>>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>>> extensions
>>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>>> actual pointer!
>>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>>> (new)
>>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>> >        __u64 rsvd2;                    -> unused
>>>>>> >};
>>>>>> >
>>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>> instead
>>>>>> >of adding even more complexity to an already abused interface? 
>>>>>> While
>>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>>> >changing how the base struct should be interpreted based on how 
>>>>>> the VM
>>>>>> >was created (which is an entirely different ioctl).
>>>>>> >
>>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>>> need
>>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>>> >
>>>>>>
>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>> mode (other than natual extensions).
>>>>>> Adding a new execbuffer3 was considered, but I think we need to 
>>>>>> be careful
>>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>>> future
>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>
>>>>> Why not? it's not like adding extensions here is really that 
>>>>> different
>>>>> than adding new ioctls.
>>>>>
>>>>> I definitely think this deserves an execbuffer3 without even
>>>>> considering future requirements. Just  to burn down the old
>>>>> requirements and pointless fields.
>>>>>
>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave 
>>>>> the
>>>>> older sw on execbuf2 for ever.
>>>>
>>>> I guess another point in favour of execbuf3 would be that it's less
>>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>> reloc/softping paths.
>>>>
>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>> vfunc, and then we share code (where it even makes sense, probably
>>>> request setup/submit need to be shared, anything else is probably
>>>> cleaner to just copypaste) with the usual helper approach.
>>>>
>>>> Also that would guarantee that really none of the old concepts like
>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>> into the new vm_bind execbuf.
>>>>
>>>> Finally I also think that copypasting would make backporting easier,
>>>> or at least more flexible, since it should make it easier to have the
>>>> upstream vm_bind co-exist with all the other things we have. Without
>>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>>> of vfuncs into the existing code would cause.
>>>>
>>>> So maybe we should do this?
>>>
>>> Thanks Dave, Daniel.
>>> There are a few things that will be common between execbuf2 and
>>> execbuf3, like request setup/submit (as you said), fence handling 
>>> (timeline fences, fence array, composite fences), engine selection,
>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>> bit position will differ).
>>> But I guess these should be fine as the suggestion here is to
>>> copy-paste the execbuff code and having a shared code where possible.
>>> Besides, we can stop supporting some older feature in execbuff3
>>> (like fence array in favor of newer timeline fences), which will
>>> further reduce common code.
>>>
>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>
>>
>> Does this sound reasonable?
>
>
> Thanks for proposing this. Some comments below.
>
>
>>
>> struct drm_i915_gem_execbuffer3 {
>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>
>>        __u32 batch_count;
>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>> virtual addresses */
>>
>>        __u64 flags;
>> #define I915_EXEC3_RING_MASK              (0x3f)
>> #define I915_EXEC3_DEFAULT                (0<<0)
>> #define I915_EXEC3_RENDER                 (1<<0)
>> #define I915_EXEC3_BSD                    (2<<0)
>> #define I915_EXEC3_BLT                    (3<<0)
>> #define I915_EXEC3_VEBOX                  (4<<0)
>
>
> Shouldn't we use the new engine selection uAPI instead?
>
> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in 
> drm_i915_gem_context_create_ext_setparam.
>
> And you can also create virtual engines with the same extension.
>
> It feels like this could be a single u32 with the engine index (in the 
> context engine map).
>
>
>>
>> #define I915_EXEC3_SECURE               (1<<6)
>> #define I915_EXEC3_IS_PINNED            (1<<7)
>
>
> What's the meaning of PINNED?
>
>
>>
>> #define I915_EXEC3_BSD_SHIFT     (8)
>> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>
>> #define I915_EXEC3_FENCE_IN             (1<<10)
>> #define I915_EXEC3_FENCE_OUT            (1<<11)
>
>
> For Mesa, as soon as we have 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
>
> So there isn't much point for FENCE_IN/OUT.
>
> Maybe check with other UMDs?


Correcting myself a bit here :

     - iris uses I915_EXEC_FENCE_ARRAY

     - anv uses I915_EXEC_FENCE_ARRAY or 
DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES


In either case we could easily switch to 
DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES all the time.


>
>
>> #define I915_EXEC3_FENCE_SUBMIT (1<<12)
>
>
> What's FENCE_SUBMIT?
>
>
>>
>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>
>>        __u64 extensions;        /* currently only for 
>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>> };
>>
>> With this, user can pass in batch addresses and count directly,
>> instead of as an extension (as this rfc series was proposing).
>>
>> I have removed many of the flags which were either legacy or not
>> applicable to BM_BIND mode.
>> I have also removed fence array support (execbuffer2.cliprects_ptr)
>> as we have timeline fence array support. Is that fine?
>> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>
>> Any thing else needs to be added or removed?
>>
>> Niranjana
>>
>>> Niranjana
>>>
>>>> -Daniel
>>>> -- 
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch
>
>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-03  6:53             ` Niranjana Vishwanathapura
  2022-06-07 10:42               ` Tvrtko Ursulin
  2022-06-08  6:40               ` Lionel Landwerlin
@ 2022-06-08  7:12               ` Lionel Landwerlin
  2022-06-08 21:24                   ` Matthew Brost
  2 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  7:12 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
> wrote:
>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>> >> VM_BIND and related uapi definitions
>>>>> >>
>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>> >>     Also add new uapi and documentation as per review comments
>>>>> >>     from Daniel.
>>>>> >>
>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>> <niranjana.vishwanathapura@intel.com>
>>>>> >> ---
>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>> +++++++++++++++++++++++++++
>>>>> >>  1 file changed, 399 insertions(+)
>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >>
>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> new file mode 100644
>>>>> >> index 000000000000..589c0a009107
>>>>> >> --- /dev/null
>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> @@ -0,0 +1,399 @@
>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>> >> +/*
>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>> >> + */
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>> >> + *
>>>>> >> + * VM_BIND feature availability.
>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>> >> + */
>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>> >> + *
>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>> >> + *
>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>> mode of binding.
>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>> execlist (ie., the
>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>> be provided
>>>>> >> + * to pass in the batch buffer addresses.
>>>>> >> + *
>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>> must be 0
>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>> must always be
>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>> batch_len fields
>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>> must be 0.
>>>>> >> + */
>>>>> >
>>>>> >From that description, it seems we have:
>>>>> >
>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>> extensions
>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>> actual pointer!
>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>> (new)
>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>> >        __u64 rsvd2;                    -> unused
>>>>> >};
>>>>> >
>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>> instead
>>>>> >of adding even more complexity to an already abused interface? While
>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>> >changing how the base struct should be interpreted based on how 
>>>>> the VM
>>>>> >was created (which is an entirely different ioctl).
>>>>> >
>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>> need
>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>> >
>>>>>
>>>>> The only change here is removing the execlist support for VM_BIND
>>>>> mode (other than natual extensions).
>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>> careful
>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>> future
>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>
>>>> Why not? it's not like adding extensions here is really that different
>>>> than adding new ioctls.
>>>>
>>>> I definitely think this deserves an execbuffer3 without even
>>>> considering future requirements. Just  to burn down the old
>>>> requirements and pointless fields.
>>>>
>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>> older sw on execbuf2 for ever.
>>>
>>> I guess another point in favour of execbuf3 would be that it's less
>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>> needed to cleanly split out the vm_bind paths from the legacy
>>> reloc/softping paths.
>>>
>>> If we invert this and do execbuf3, then there's the existing ioctl
>>> vfunc, and then we share code (where it even makes sense, probably
>>> request setup/submit need to be shared, anything else is probably
>>> cleaner to just copypaste) with the usual helper approach.
>>>
>>> Also that would guarantee that really none of the old concepts like
>>> i915_active on the vma or vma open counts and all that stuff leaks
>>> into the new vm_bind execbuf.
>>>
>>> Finally I also think that copypasting would make backporting easier,
>>> or at least more flexible, since it should make it easier to have the
>>> upstream vm_bind co-exist with all the other things we have. Without
>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>> of vfuncs into the existing code would cause.
>>>
>>> So maybe we should do this?
>>
>> Thanks Dave, Daniel.
>> There are a few things that will be common between execbuf2 and
>> execbuf3, like request setup/submit (as you said), fence handling 
>> (timeline fences, fence array, composite fences), engine selection,
>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>> bit position will differ).
>> But I guess these should be fine as the suggestion here is to
>> copy-paste the execbuff code and having a shared code where possible.
>> Besides, we can stop supporting some older feature in execbuff3
>> (like fence array in favor of newer timeline fences), which will
>> further reduce common code.
>>
>> Ok, I will update this series by adding execbuf3 and send out soon.
>>
>
> Does this sound reasonable?
>
> struct drm_i915_gem_execbuffer3 {
>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>
>        __u32 batch_count;
>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
> virtual addresses */


Quick question raised on IRC about the batches : Are multiple batches 
limited to virtual engines?


Thanks,


-Lionel


>
>        __u64 flags;
> #define I915_EXEC3_RING_MASK              (0x3f)
> #define I915_EXEC3_DEFAULT                (0<<0)
> #define I915_EXEC3_RENDER                 (1<<0)
> #define I915_EXEC3_BSD                    (2<<0)
> #define I915_EXEC3_BLT                    (3<<0)
> #define I915_EXEC3_VEBOX                  (4<<0)
>
> #define I915_EXEC3_SECURE               (1<<6)
> #define I915_EXEC3_IS_PINNED            (1<<7)
>
> #define I915_EXEC3_BSD_SHIFT     (8)
> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>
> #define I915_EXEC3_FENCE_IN             (1<<10)
> #define I915_EXEC3_FENCE_OUT            (1<<11)
> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)
>
>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>
>        __u64 extensions;        /* currently only for 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> };
>
> With this, user can pass in batch addresses and count directly,
> instead of as an extension (as this rfc series was proposing).
>
> I have removed many of the flags which were either legacy or not
> applicable to BM_BIND mode.
> I have also removed fence array support (execbuffer2.cliprects_ptr)
> as we have timeline fence array support. Is that fine?
> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>
> Any thing else needs to be added or removed?
>
> Niranjana
>
>> Niranjana
>>
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 19:37     ` Niranjana Vishwanathapura
@ 2022-06-08  7:17       ` Tvrtko Ursulin
  2022-06-08  9:12         ` Matthew Auld
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  7:17 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig, Matthew Auld


On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>
>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>> VM_BIND and related uapi definitions
>>>
>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>     Also add new uapi and documentation as per review comments
>>>     from Daniel.
>>>
>>> Signed-off-by: Niranjana Vishwanathapura 
>>> <niranjana.vishwanathapura@intel.com>
>>> ---
>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>  1 file changed, 399 insertions(+)
>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>
>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>> new file mode 100644
>>> index 000000000000..589c0a009107
>>> --- /dev/null
>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>> @@ -0,0 +1,399 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +
>>> +/**
>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>> + *
>>> + * VM_BIND feature availability.
>>> + * See typedef drm_i915_getparam_t param.
>>> + */
>>> +#define I915_PARAM_HAS_VM_BIND        57
>>> +
>>> +/**
>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>> + *
>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>> + * See struct drm_i915_gem_vm_control flags.
>>> + *
>>> + * A VM in VM_BIND mode will not support the older execbuff mode of 
>>> binding.
>>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>> (ie., the
>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be 
>>> provided
>>> + * to pass in the batch buffer addresses.
>>> + *
>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must 
>>> always be
>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len 
>>> fields
>>> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>> + */
>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>> +
>>> +/**
>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>> + *
>>> + * Flag to declare context as long running.
>>> + * See struct drm_i915_gem_context_create_ext flags.
>>> + *
>>> + * Usage of dma-fence expects that they complete in reasonable 
>>> amount of time.
>>> + * Compute on the other hand can be long running. Hence it is not 
>>> appropriate
>>> + * for compute contexts to export request completion dma-fence to user.
>>> + * The dma-fence usage will be limited to in-kernel consumption only.
>>> + * Compute contexts need to use user/memory fence.
>>> + *
>>> + * So, long running contexts do not support output fences. Hence,
>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are 
>>> expected
>>> + * to be not used.
>>> + *
>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects 
>>> mapped
>>> + * to long running contexts.
>>> + */
>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>> +
>>> +/* VM_BIND related ioctls */
>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>> +
>>> +#define DRM_IOCTL_I915_GEM_VM_BIND        DRM_IOWR(DRM_COMMAND_BASE 
>>> + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND        
>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct 
>>> drm_i915_gem_vm_bind)
>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE    
>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct 
>>> drm_i915_gem_wait_user_fence)
>>> +
>>> +/**
>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>> + *
>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>> mapping of GPU
>>> + * virtual address (VA) range to the section of an object that 
>>> should be bound
>>> + * in the device page table of the specified address space (VM).
>>> + * The VA range specified must be unique (ie., not currently bound) 
>>> and can
>>> + * be mapped to whole object or a section of the object (partial 
>>> binding).
>>> + * Multiple VA mappings can be created to the same section of the 
>>> object
>>> + * (aliasing).
>>> + */
>>> +struct drm_i915_gem_vm_bind {
>>> +    /** @vm_id: VM (address space) id to bind */
>>> +    __u32 vm_id;
>>> +
>>> +    /** @handle: Object handle */
>>> +    __u32 handle;
>>> +
>>> +    /** @start: Virtual Address start to bind */
>>> +    __u64 start;
>>> +
>>> +    /** @offset: Offset in object to bind */
>>> +    __u64 offset;
>>> +
>>> +    /** @length: Length of mapping to bind */
>>> +    __u64 length;
>>
>> Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? 
>> Or if not userspace is expected to map the remainder of the space to a 
>> dummy object? In which case would there be any alignment/padding 
>> issues preventing the two bind to be placed next to each other?
>>
>> I ask because someone from the compute side asked me about a problem 
>> with their strategy of dealing with overfetch and I suggested pad to 
>> size.
>>
> 
> Thanks Tvrtko,
> I think we shouldn't be needing it. As with VM_BIND VA assignment
> is completely pushed to userspace, no padding should be necessary
> once the 'start' and 'size' alignment conditions are met.
> 
> I will add some documentation on alignment requirement here.
> Generally, 'start' and 'size' should be 4K aligned. But, I think
> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
> be 64K aligned.

+ Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one 
component and another has to apply more alignment to it, to deal with 
overfetch. Since they cannot grow the actual BO if they wanted to 
VM_BIND a scratch area on top? Or perhaps none of this is a problem on 
discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in 
i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to 
discrete?

Regards,

Tvrtko

> 
> Niranjana
> 
>> Regards,
>>
>> Tvrtko
>>
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_GEM_VM_BIND_READONLY:
>>> +     * Mapping is read-only.
>>> +     *
>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>> +     * Capture this mapping in the dump upon GPU error.
>>> +     */
>>> +    __u64 flags;
>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>> +
>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>> mapping. */
>>> +    __u64 extensions;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>> + *
>>> + * This structure is passed to VM_UNBIND ioctl and specifies the GPU 
>>> virtual
>>> + * address (VA) range that should be unbound from the device page 
>>> table of the
>>> + * specified address space (VM). The specified VA range must match 
>>> one of the
>>> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>> + * completion.
>>> + */
>>> +struct drm_i915_gem_vm_unbind {
>>> +    /** @vm_id: VM (address space) id to bind */
>>> +    __u32 vm_id;
>>> +
>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>> +    __u32 rsvd;
>>> +
>>> +    /** @start: Virtual Address start to unbind */
>>> +    __u64 start;
>>> +
>>> +    /** @length: Length of mapping to unbind */
>>> +    __u64 length;
>>> +
>>> +    /** @flags: reserved for future usage, currently MBZ */
>>> +    __u64 flags;
>>> +
>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>> mapping. */
>>> +    __u64 extensions;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_fence - An input or output fence for the 
>>> vm_bind
>>> + * or the vm_unbind work.
>>> + *
>>> + * The vm_bind or vm_unbind aync worker will wait for input fence to 
>>> signal
>>> + * before starting the binding or unbinding.
>>> + *
>>> + * The vm_bind or vm_unbind async worker will signal the returned 
>>> output fence
>>> + * after the completion of binding or unbinding.
>>> + */
>>> +struct drm_i915_vm_bind_fence {
>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>> signal. */
>>> +    __u32 handle;
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_VM_BIND_FENCE_WAIT:
>>> +     * Wait for the input fence before binding/unbinding
>>> +     *
>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>> +     * Return bind/unbind completion fence as output
>>> +     */
>>> +    __u32 flags;
>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for 
>>> vm_bind
>>> + * and vm_unbind.
>>> + *
>>> + * This structure describes an array of timeline drm_syncobj and 
>>> associated
>>> + * points for timeline variants of drm_syncobj. These timeline 
>>> 'drm_syncobj's
>>> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>> + */
>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /**
>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>> @value_ptr
>>> +     * arrays.
>>> +     */
>>> +    __u64 fence_count;
>>> +
>>> +    /**
>>> +     * @handles_ptr: Pointer to an array of struct 
>>> drm_i915_vm_bind_fence
>>> +     * of length @fence_count.
>>> +     */
>>> +    __u64 handles_ptr;
>>> +
>>> +    /**
>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>> +     * @fence_count.
>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>> +     * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>>> +     * binary one.
>>> +     */
>>> +    __u64 values_ptr;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_user_fence - An input or output user 
>>> fence for the
>>> + * vm_bind or the vm_unbind work.
>>> + *
>>> + * The vm_bind or vm_unbind aync worker will wait for the input 
>>> fence (value at
>>> + * @addr to become equal to @val) before starting the binding or 
>>> unbinding.
>>> + *
>>> + * The vm_bind or vm_unbind async worker will signal the output 
>>> fence after
>>> + * the completion of binding or unbinding by writing @val to memory 
>>> location at
>>> + * @addr
>>> + */
>>> +struct drm_i915_vm_bind_user_fence {
>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>> address */
>>> +    __u64 addr;
>>> +
>>> +    /** @val: User/Memory fence value to be written after bind 
>>> completion */
>>> +    __u64 val;
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>> +     * Wait for the input fence before binding/unbinding
>>> +     *
>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>> +     * Return bind/unbind completion fence as output
>>> +     */
>>> +    __u32 flags;
>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for 
>>> vm_bind
>>> + * and vm_unbind.
>>> + *
>>> + * These user fences can be input or output fences
>>> + * (See struct drm_i915_vm_bind_user_fence).
>>> + */
>>> +struct drm_i915_vm_bind_ext_user_fence {
>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /** @fence_count: Number of elements in the @user_fence_ptr 
>>> array. */
>>> +    __u64 fence_count;
>>> +
>>> +    /**
>>> +     * @user_fence_ptr: Pointer to an array of
>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>> +     */
>>> +    __u64 user_fence_ptr;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of 
>>> batch buffer
>>> + * gpu virtual addresses.
>>> + *
>>> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this 
>>> extension
>>> + * must always be appended in the VM_BIND mode and it will be an 
>>> error to
>>> + * append this extension in older non-VM_BIND mode.
>>> + */
>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /** @count: Number of addresses in the addr array. */
>>> +    __u32 count;
>>> +
>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>> +    __u64 addr[0];
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch 
>>> completion
>>> + * signaling extension.
>>> + *
>>> + * This extension allows user to attach a user fence (@addr, @value 
>>> pair) to an
>>> + * execbuf to be signaled by the command streamer after the 
>>> completion of first
>>> + * level batch, by writing the @value at specified @addr and 
>>> triggering an
>>> + * interrupt.
>>> + * User can either poll for this user fence to signal or can also 
>>> wait on it
>>> + * with i915_gem_wait_user_fence ioctl.
>>> + * This is very much usefaul for long running contexts where waiting 
>>> on dma-fence
>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>> + */
>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /**
>>> +     * @addr: User/Memory fence qword aligned GPU virtual address.
>>> +     *
>>> +     * Address has to be a valid GPU virtual address at the time of
>>> +     * first level batch completion.
>>> +     */
>>> +    __u64 addr;
>>> +
>>> +    /**
>>> +     * @value: User/Memory fence Value to be written to above address
>>> +     * after first level batch completes.
>>> +     */
>>> +    __u64 value;
>>> +
>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>> +    __u64 rsvd;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to make the 
>>> object
>>> + * private to the specified VM.
>>> + *
>>> + * See struct drm_i915_gem_create_ext.
>>> + */
>>> +struct drm_i915_gem_create_ext_vm_private {
>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /** @vm_id: Id of the VM to which the object is private */
>>> +    __u32 vm_id;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>> + *
>>> + * User/Memory fence can be woken up either by:
>>> + *
>>> + * 1. GPU context indicated by @ctx_id, or,
>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>> + *    @ctx_id is ignored when this flag is set.
>>> + *
>>> + * Wakeup condition is,
>>> + * ``((*addr & mask) op (value & mask))``
>>> + *
>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>> <indefinite_dma_fences>`
>>> + */
>>> +struct drm_i915_gem_wait_user_fence {
>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>> +    __u64 extensions;
>>> +
>>> +    /** @addr: User/Memory fence address */
>>> +    __u64 addr;
>>> +
>>> +    /** @ctx_id: Id of the Context which will signal the fence. */
>>> +    __u32 ctx_id;
>>> +
>>> +    /** @op: Wakeup condition operator */
>>> +    __u16 op;
>>> +#define I915_UFENCE_WAIT_EQ      0
>>> +#define I915_UFENCE_WAIT_NEQ     1
>>> +#define I915_UFENCE_WAIT_GT      2
>>> +#define I915_UFENCE_WAIT_GTE     3
>>> +#define I915_UFENCE_WAIT_LT      4
>>> +#define I915_UFENCE_WAIT_LTE     5
>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>> +#define I915_UFENCE_WAIT_AFTER   7
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_UFENCE_WAIT_SOFT:
>>> +     *
>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>> +     *
>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>> +     *
>>> +     * Wait timeout specified as absolute time.
>>> +     */
>>> +    __u16 flags;
>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>> +
>>> +    /** @value: Wakeup value */
>>> +    __u64 value;
>>> +
>>> +    /** @mask: Wakeup mask */
>>> +    __u64 mask;
>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>> +
>>> +    /**
>>> +     * @timeout: Wait timeout in nanoseconds.
>>> +     *
>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is 
>>> the
>>> +     * absolute time in nsec.
>>> +     */
>>> +    __s64 timeout;
>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-07 21:32                   ` Niranjana Vishwanathapura
@ 2022-06-08  7:33                     ` Tvrtko Ursulin
  2022-06-08 21:44                         ` Niranjana Vishwanathapura
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  7:33 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Jason Ekstrand
  Cc: Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König



On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>> On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>  On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>  <niranjana.vishwanathapura@intel.com> wrote:
>>>
>>>    On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>>    >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>    >
>>>    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>    >     <niranjana.vishwanathapura@intel.com> wrote:
>>>    >
>>>    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost 
>>> wrote:
>>>    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>    wrote:
>>>    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>    >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>    binding/unbinding
>>>    >       the mapping in an
>>>    >       >> > +async worker. The binding and unbinding will work 
>>> like a
>>>    special
>>>    >       GPU engine.
>>>    >       >> > +The binding and unbinding operations are serialized and
>>>    will
>>>    >       wait on specified
>>>    >       >> > +input fences before the operation and will signal the
>>>    output
>>>    >       fences upon the
>>>    >       >> > +completion of the operation. Due to serialization,
>>>    completion of
>>>    >       an operation
>>>    >       >> > +will also indicate that all previous operations are 
>>> also
>>>    >       complete.
>>>    >       >>
>>>    >       >> I guess we should avoid saying "will immediately start
>>>    >       binding/unbinding" if
>>>    >       >> there are fences involved.
>>>    >       >>
>>>    >       >> And the fact that it's happening in an async worker 
>>> seem to
>>>    imply
>>>    >       it's not
>>>    >       >> immediate.
>>>    >       >>
>>>    >
>>>    >       Ok, will fix.
>>>    >       This was added because in earlier design binding was deferred
>>>    until
>>>    >       next execbuff.
>>>    >       But now it is non-deferred (immediate in that sense). But 
>>> yah,
>>>    this is
>>>    >       confusing
>>>    >       and will fix it.
>>>    >
>>>    >       >>
>>>    >       >> I have a question on the behavior of the bind operation 
>>> when
>>>    no
>>>    >       input fence
>>>    >       >> is provided. Let say I do :
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence1)
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence2)
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence3)
>>>    >       >>
>>>    >       >>
>>>    >       >> In what order are the fences going to be signaled?
>>>    >       >>
>>>    >       >> In the order of VM_BIND ioctls? Or out of order?
>>>    >       >>
>>>    >       >> Because you wrote "serialized I assume it's : in order
>>>    >       >>
>>>    >
>>>    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind 
>>> and
>>>    unbind
>>>    >       will use
>>>    >       the same queue and hence are ordered.
>>>    >
>>>    >       >>
>>>    >       >> One thing I didn't realize is that because we only get one
>>>    >       "VM_BIND" engine,
>>>    >       >> there is a disconnect from the Vulkan specification.
>>>    >       >>
>>>    >       >> In Vulkan VM_BIND operations are serialized but per 
>>> engine.
>>>    >       >>
>>>    >       >> So you could have something like this :
>>>    >       >>
>>>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>>    >       >>
>>>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>>    >       >>
>>>    >       >>
>>>    >       >> fence1 is not signaled
>>>    >       >>
>>>    >       >> fence3 is signaled
>>>    >       >>
>>>    >       >> So the second VM_BIND will proceed before the first 
>>> VM_BIND.
>>>    >       >>
>>>    >       >>
>>>    >       >> I guess we can deal with that scenario in userspace by 
>>> doing
>>>    the
>>>    >       wait
>>>    >       >> ourselves in one thread per engines.
>>>    >       >>
>>>    >       >> But then it makes the VM_BIND input fences useless.
>>>    >       >>
>>>    >       >>
>>>    >       >> Daniel : what do you think? Should be rework this or just
>>>    deal with
>>>    >       wait
>>>    >       >> fences in userspace?
>>>    >       >>
>>>    >       >
>>>    >       >My opinion is rework this but make the ordering via an 
>>> engine
>>>    param
>>>    >       optional.
>>>    >       >
>>>    >       >e.g. A VM can be configured so all binds are ordered 
>>> within the
>>>    VM
>>>    >       >
>>>    >       >e.g. A VM can be configured so all binds accept an engine
>>>    argument
>>>    >       (in
>>>    >       >the case of the i915 likely this is a gem context handle) 
>>> and
>>>    binds
>>>    >       >ordered with respect to that engine.
>>>    >       >
>>>    >       >This gives UMDs options as the later likely consumes more 
>>> KMD
>>>    >       resources
>>>    >       >so if a different UMD can live with binds being ordered 
>>> within
>>>    the VM
>>>    >       >they can use a mode consuming less resources.
>>>    >       >
>>>    >
>>>    >       I think we need to be careful here if we are looking for some
>>>    out of
>>>    >       (submission) order completion of vm_bind/unbind.
>>>    >       In-order completion means, in a batch of binds and unbinds 
>>> to be
>>>    >       completed in-order, user only needs to specify in-fence 
>>> for the
>>>    >       first bind/unbind call and the our-fence for the last
>>>    bind/unbind
>>>    >       call. Also, the VA released by an unbind call can be 
>>> re-used by
>>>    >       any subsequent bind call in that in-order batch.
>>>    >
>>>    >       These things will break if binding/unbinding were to be 
>>> allowed
>>>    to
>>>    >       go out of order (of submission) and user need to be extra
>>>    careful
>>>    >       not to run into pre-mature triggereing of out-fence and bind
>>>    failing
>>>    >       as VA is still in use etc.
>>>    >
>>>    >       Also, VM_BIND binds the provided mapping on the specified
>>>    address
>>>    >       space
>>>    >       (VM). So, the uapi is not engine/context specific.
>>>    >
>>>    >       We can however add a 'queue' to the uapi which can be one 
>>> from
>>>    the
>>>    >       pre-defined queues,
>>>    >       I915_VM_BIND_QUEUE_0
>>>    >       I915_VM_BIND_QUEUE_1
>>>    >       ...
>>>    >       I915_VM_BIND_QUEUE_(N-1)
>>>    >
>>>    >       KMD will spawn an async work queue for each queue which will
>>>    only
>>>    >       bind the mappings on that queue in the order of submission.
>>>    >       User can assign the queue to per engine or anything like 
>>> that.
>>>    >
>>>    >       But again here, user need to be careful and not deadlock 
>>> these
>>>    >       queues with circular dependency of fences.
>>>    >
>>>    >       I prefer adding this later an as extension based on 
>>> whether it
>>>    >       is really helping with the implementation.
>>>    >
>>>    >     I can tell you right now that having everything on a single
>>>    in-order
>>>    >     queue will not get us the perf we want.  What vulkan really 
>>> wants
>>>    is one
>>>    >     of two things:
>>>    >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>>    whatever
>>>    >     their dependencies are resolved and we ensure ordering 
>>> ourselves
>>>    by
>>>    >     having a syncobj in the VkQueue.
>>>    >      2. The ability to create multiple VM_BIND queues.  We need at
>>>    least 2
>>>    >     but I don't see why there needs to be a limit besides the 
>>> limits
>>>    the
>>>    >     i915 API already has on the number of engines.  Vulkan could
>>>    expose
>>>    >     multiple sparse binding queues to the client if it's not
>>>    arbitrarily
>>>    >     limited.
>>>
>>>    Thanks Jason, Lionel.
>>>
>>>    Jason, what are you referring to when you say "limits the i915 API
>>>    already
>>>    has on the number of engines"? I am not sure if there is such an uapi
>>>    today.
>>>
>>>  There's a limit of something like 64 total engines today based on the
>>>  number of bits we can cram into the exec flags in execbuffer2.  I think
>>>  someone had an extended version that allowed more but I ripped it out
>>>  because no one was using it.  Of course, execbuffer3 might not have 
>>> that
>>>  problem at all.
>>>
>>
>> Thanks Jason.
>> Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>> will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>> and somehow export it to user (I am thinking of embedding it in
>> I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>> queues.
> 
> Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
> will also have. So, we can simply define in vm_bind/unbind structures,
> 
> #define I915_VM_BIND_MAX_QUEUE   64
>         __u32 queue;
> 
> I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines hardware 
can have? I suggest not to do that.

Change with added this:

	if (set.num_engines > I915_EXEC_RING_MASK + 1)
		return -EINVAL;

To context creation needs to be undone and so let users create engine 
maps with all hardware engines, and let execbuf3 access them all.

Regards,

Tvrtko

> 
> Niranjana
> 
>>
>>>    I am trying to see how many queues we need and don't want it to be
>>>    arbitrarily
>>>    large and unduely blow up memory usage and complexity in i915 driver.
>>>
>>>  I expect a Vulkan driver to use at most 2 in the vast majority of 
>>> cases. I
>>>  could imagine a client wanting to create more than 1 sparse queue in 
>>> which
>>>  case, it'll be N+1 but that's unlikely.  As far as complexity goes, 
>>> once
>>>  you allow two, I don't think the complexity is going up by allowing 
>>> N.  As
>>>  for memory usage, creating more queues means more memory.  That's a
>>>  trade-off that userspace can make.  Again, the expected number here 
>>> is 1
>>>  or 2 in the vast majority of cases so I don't think you need to worry.
>>
>> Ok, will start with n=3 meaning 8 queues.
>> That would require us create 8 workqueues.
>> We can change 'n' later if required.
>>
>> Niranjana
>>
>>>
>>>    >     Why?  Because Vulkan has two basic kind of bind operations 
>>> and we
>>>    don't
>>>    >     want any dependencies between them:
>>>    >      1. Immediate.  These happen right after BO creation or 
>>> maybe as
>>>    part of
>>>    >     vkBindImageMemory() or VkBindBufferMemory().  These don't 
>>> happen
>>>    on a
>>>    >     queue and we don't want them serialized with anything.  To
>>>    synchronize
>>>    >     with submit, we'll have a syncobj in the VkDevice which is
>>>    signaled by
>>>    >     all immediate bind operations and make submits wait on it.
>>>    >      2. Queued (sparse): These happen on a VkQueue which may be the
>>>    same as
>>>    >     a render/compute queue or may be its own queue.  It's up to us
>>>    what we
>>>    >     want to advertise.  From the Vulkan API PoV, this is like any
>>>    other
>>>    >     queue.  Operations on it wait on and signal semaphores.  If we
>>>    have a
>>>    >     VM_BIND engine, we'd provide syncobjs to wait and signal 
>>> just like
>>>    we do
>>>    >     in execbuf().
>>>    >     The important thing is that we don't want one type of 
>>> operation to
>>>    block
>>>    >     on the other.  If immediate binds are blocking on sparse binds,
>>>    it's
>>>    >     going to cause over-synchronization issues.
>>>    >     In terms of the internal implementation, I know that there's 
>>> going
>>>    to be
>>>    >     a lock on the VM and that we can't actually do these things in
>>>    >     parallel.  That's fine.  Once the dma_fences have signaled and
>>>    we're
>>>
>>>    Thats correct. It is like a single VM_BIND engine with multiple 
>>> queues
>>>    feeding to it.
>>>
>>>  Right.  As long as the queues themselves are independent and can 
>>> block on
>>>  dma_fences without holding up other queues, I think we're fine.
>>>
>>>    >     unblocked to do the bind operation, I don't care if there's 
>>> a bit
>>>    of
>>>    >     synchronization due to locking.  That's expected.  What we 
>>> can't
>>>    afford
>>>    >     to have is an immediate bind operation suddenly blocking on a
>>>    sparse
>>>    >     operation which is blocked on a compute job that's going to run
>>>    for
>>>    >     another 5ms.
>>>
>>>    As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>>    VM_BIND
>>>    on other VMs. I am not sure about usecases here, but just wanted to
>>>    clarify.
>>>
>>>  Yes, that's what I would expect.
>>>  --Jason
>>>
>>>    Niranjana
>>>
>>>    >     For reference, Windows solves this by allowing arbitrarily many
>>>    paging
>>>    >     queues (what they call a VM_BIND engine/queue).  That design 
>>> works
>>>    >     pretty well and solves the problems in question.  Again, we 
>>> could
>>>    just
>>>    >     make everything out-of-order and require using syncobjs to 
>>> order
>>>    things
>>>    >     as userspace wants. That'd be fine too.
>>>    >     One more note while I'm here: danvet said something on IRC 
>>> about
>>>    VM_BIND
>>>    >     queues waiting for syncobjs to materialize.  We don't really
>>>    want/need
>>>    >     this.  We already have all the machinery in userspace to handle
>>>    >     wait-before-signal and waiting for syncobj fences to 
>>> materialize
>>>    and
>>>    >     that machinery is on by default.  It would actually take 
>>> MORE work
>>>    in
>>>    >     Mesa to turn it off and take advantage of the kernel being 
>>> able to
>>>    wait
>>>    >     for syncobjs to materialize.  Also, getting that right is
>>>    ridiculously
>>>    >     hard and I really don't want to get it wrong in kernel 
>>> space.     When we
>>>    >     do memory fences, wait-before-signal will be a thing.  We don't
>>>    need to
>>>    >     try and make it a thing for syncobj.
>>>    >     --Jason
>>>    >
>>>    >   Thanks Jason,
>>>    >
>>>    >   I missed the bit in the Vulkan spec that we're allowed to have a
>>>    sparse
>>>    >   queue that does not implement either graphics or compute 
>>> operations
>>>    :
>>>    >
>>>    >     "While some implementations may include
>>>    VK_QUEUE_SPARSE_BINDING_BIT
>>>    >     support in queue families that also include
>>>    >
>>>    >      graphics and compute support, other implementations may only
>>>    expose a
>>>    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>    >
>>>    >      family."
>>>    >
>>>    >   So it can all be all a vm_bind engine that just does bind/unbind
>>>    >   operations.
>>>    >
>>>    >   But yes we need another engine for the immediate/non-sparse
>>>    operations.
>>>    >
>>>    >   -Lionel
>>>    >
>>>    >         >
>>>    >       Daniel, any thoughts?
>>>    >
>>>    >       Niranjana
>>>    >
>>>    >       >Matt
>>>    >       >
>>>    >       >>
>>>    >       >> Sorry I noticed this late.
>>>    >       >>
>>>    >       >>
>>>    >       >> -Lionel
>>>    >       >>
>>>    >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 21:25                 ` Niranjana Vishwanathapura
@ 2022-06-08  7:34                   ` Tvrtko Ursulin
  2022-06-08 19:52                     ` Niranjana Vishwanathapura
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  7:34 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig


On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
>>
>> On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
>>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>>> wrote:
>>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>
>>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>
>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>      Also add new uapi and documentation as per review comments
>>>>>>>>>      from Daniel.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>>> ---
>>>>>>>>>   Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>>   1 file changed, 399 insertions(+)
>>>>>>>>>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>
>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>> new file mode 100644
>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>> +/*
>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>> + *
>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>> + */
>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND               57
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>> + *
>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>> + *
>>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>> execbuff mode of binding.
>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>> execlist (ie., the
>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>> must be provided
>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>> + *
>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>> &drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS 
>>>>>>> flag must always be
>>>>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>> batch_len fields
>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used 
>>>>>>> and must be 0.
>>>>>>>>> + */
>>>>>>>>
>>>>>>>> From that description, it seems we have:
>>>>>>>>
>>>>>>>> struct drm_i915_gem_execbuffer2 {
>>>>>>>>         __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>         __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>         __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>         __u32 batch_len;                -> must be 0 (new)
>>>>>>>>         __u32 DR1;                      -> must be 0 (old)
>>>>>>>>         __u32 DR4;                      -> must be 0 (old)
>>>>>>>>         __u32 num_cliprects; (fences)   -> must be 0 since 
>>>>>>> using extensions
>>>>>>>>         __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>> contains an actual pointer!
>>>>>>>>         __u64 flags;                    -> some flags must be 0 
>>>>>>>> (new)
>>>>>>>>         __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>>         __u64 rsvd2;                    -> unused
>>>>>>>> };
>>>>>>>>
>>>>>>>> Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>>>> instead
>>>>>>>> of adding even more complexity to an already abused interface? 
>>>>>>>> While
>>>>>>>> the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>> we're doing here is extending the ioctl usage, we're completely
>>>>>>>> changing how the base struct should be interpreted based on 
>>>>>>> how the VM
>>>>>>>> was created (which is an entirely different ioctl).
>>>>>>>>
>>>>>>>> From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>>>> already at -6 without these changes. I think after vm_bind we'll 
>>>>>>>> need
>>>>>>>> to create a -11 entry just to deal with this ioctl.
>>>>>>>>
>>>>>>>
>>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>>> mode (other than natual extensions).
>>>>>>> Adding a new execbuffer3 was considered, but I think we need to 
>>>>>>> be careful
>>>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>>>> future
>>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>
>>>>>> Why not? it's not like adding extensions here is really that 
>>>>>> different
>>>>>> than adding new ioctls.
>>>>>>
>>>>>> I definitely think this deserves an execbuffer3 without even
>>>>>> considering future requirements. Just  to burn down the old
>>>>>> requirements and pointless fields.
>>>>>>
>>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave 
>>>>>> the
>>>>>> older sw on execbuf2 for ever.
>>>>>
>>>>> I guess another point in favour of execbuf3 would be that it's less
>>>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>>> reloc/softping paths.
>>>>>
>>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>>> vfunc, and then we share code (where it even makes sense, probably
>>>>> request setup/submit need to be shared, anything else is probably
>>>>> cleaner to just copypaste) with the usual helper approach.
>>>>>
>>>>> Also that would guarantee that really none of the old concepts like
>>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>>> into the new vm_bind execbuf.
>>>>>
>>>>> Finally I also think that copypasting would make backporting easier,
>>>>> or at least more flexible, since it should make it easier to have the
>>>>> upstream vm_bind co-exist with all the other things we have. Without
>>>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>>>> of vfuncs into the existing code would cause.
>>>>>
>>>>> So maybe we should do this?
>>>>
>>>> Thanks Dave, Daniel.
>>>> There are a few things that will be common between execbuf2 and
>>>> execbuf3, like request setup/submit (as you said), fence handling 
>>>> (timeline fences, fence array, composite fences), engine selection,
>>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>> bit position will differ).
>>>> But I guess these should be fine as the suggestion here is to
>>>> copy-paste the execbuff code and having a shared code where possible.
>>>> Besides, we can stop supporting some older feature in execbuff3
>>>> (like fence array in favor of newer timeline fences), which will
>>>> further reduce common code.
>>>>
>>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>>
>>>
>>> Does this sound reasonable?
>>>
>>> struct drm_i915_gem_execbuffer3 {
>>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>
>>>        __u32 batch_count;
>>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>>> virtual addresses */
>>
>> Casual stumble upon..
>>
>> Alternatively you could embed N pointers to make life a bit easier for 
>> both userspace and kernel side. Yes, but then "N batch buffers should 
>> be enough for everyone" problem.. :)
>>
> 
> Thanks Tvrtko,
> Yes, hence the batch_addr_ptr.

Right, but then userspace has to allocate a separate buffer and kernel 
has to access it separately from a single copy_from_user. Pros and cons 
of "this many batches should be enough for everyone" versus the extra 
operations.

Hmm.. for the common case of one batch - you could define the uapi to 
say if batch_count is one then pointer is GPU VA to the batch itself, 
not a pointer to userspace array of GPU VA?

Regards,

Tvrtko

>>>        __u64 flags;
>>> #define I915_EXEC3_RING_MASK              (0x3f)
>>> #define I915_EXEC3_DEFAULT                (0<<0)
>>> #define I915_EXEC3_RENDER                 (1<<0)
>>> #define I915_EXEC3_BSD                    (2<<0)
>>> #define I915_EXEC3_BLT                    (3<<0)
>>> #define I915_EXEC3_VEBOX                  (4<<0)
>>>
>>> #define I915_EXEC3_SECURE               (1<<6)
>>> #define I915_EXEC3_IS_PINNED            (1<<7)
>>>
>>> #define I915_EXEC3_BSD_SHIFT     (8)
>>> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>
>> I'd suggest legacy engine selection is unwanted, especially not with 
>> the convoluted BSD1/2 flags. Can we just require context with engine 
>> map and index? Or if default context has to be supported then I'd 
>> suggest ...class_instance for that mode.
>>
> 
> Ok, I will be happy to remove it and only support contexts with
> engine map, if UMDs agree on that.
> 
>>> #define I915_EXEC3_FENCE_IN             (1<<10)
>>> #define I915_EXEC3_FENCE_OUT            (1<<11)
>>> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)
>>
>> People are likely to object to submit fence since generic mechanism to 
>> align submissions was rejected.
>>
> 
> Ok, again, I can remove it if UMDs are ok with it.
> 
>>>
>>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>
>> New ioctl you can afford dedicated fields.
>>
> 
> Yes, but as I asked below, I am not sure if we need this or the
> timeline fence arry extension we have is good enough.
> 
>> In any case I suggest you involve UMD folks in designing it.
>>
> 
> Yah.
> Paulo, Lionel, Jason, Daniel, can you comment on these regarding
> what will UMD need in execbuf3 and what can be removed?
> 
> Thanks,
> Niranjana
> 
>> Regards,
>>
>> Tvrtko
>>
>>>
>>>        __u64 extensions;        /* currently only for 
>>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>> };
>>>
>>> With this, user can pass in batch addresses and count directly,
>>> instead of as an extension (as this rfc series was proposing).
>>>
>>> I have removed many of the flags which were either legacy or not
>>> applicable to BM_BIND mode.
>>> I have also removed fence array support (execbuffer2.cliprects_ptr)
>>> as we have timeline fence array support. Is that fine?
>>> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>>
>>> Any thing else needs to be added or removed?
>>>
>>> Niranjana
>>>
>>>> Niranjana
>>>>
>>>>> -Daniel
>>>>> -- 
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  6:40               ` Lionel Landwerlin
  2022-06-08  6:43                 ` Lionel Landwerlin
@ 2022-06-08  8:36                 ` Tvrtko Ursulin
  2022-06-08  8:45                   ` Lionel Landwerlin
  1 sibling, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  8:36 UTC (permalink / raw)
  To: Lionel Landwerlin, Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig


On 08/06/2022 07:40, Lionel Landwerlin wrote:
> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>> wrote:
>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>
>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>> >> VM_BIND and related uapi definitions
>>>>>> >>
>>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>> >>     Also add new uapi and documentation as per review comments
>>>>>> >>     from Daniel.
>>>>>> >>
>>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>> >> ---
>>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>> +++++++++++++++++++++++++++
>>>>>> >>  1 file changed, 399 insertions(+)
>>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >>
>>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >> new file mode 100644
>>>>>> >> index 000000000000..589c0a009107
>>>>>> >> --- /dev/null
>>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >> @@ -0,0 +1,399 @@
>>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>>> >> +/*
>>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>>> >> + */
>>>>>> >> +
>>>>>> >> +/**
>>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>> >> + *
>>>>>> >> + * VM_BIND feature availability.
>>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>>> >> + */
>>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>>> >> +
>>>>>> >> +/**
>>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>> >> + *
>>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>>> >> + *
>>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>> mode of binding.
>>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>> execlist (ie., the
>>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>> be provided
>>>>>> >> + * to pass in the batch buffer addresses.
>>>>>> >> + *
>>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>> must be 0
>>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>> must always be
>>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>> batch_len fields
>>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>> must be 0.
>>>>>> >> + */
>>>>>> >
>>>>>> >From that description, it seems we have:
>>>>>> >
>>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>>> extensions
>>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>>> actual pointer!
>>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>>> (new)
>>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>> >        __u64 rsvd2;                    -> unused
>>>>>> >};
>>>>>> >
>>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>> instead
>>>>>> >of adding even more complexity to an already abused interface? While
>>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>>> >changing how the base struct should be interpreted based on how 
>>>>>> the VM
>>>>>> >was created (which is an entirely different ioctl).
>>>>>> >
>>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>>> need
>>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>>> >
>>>>>>
>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>> mode (other than natual extensions).
>>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>>> careful
>>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>>> future
>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>
>>>>> Why not? it's not like adding extensions here is really that different
>>>>> than adding new ioctls.
>>>>>
>>>>> I definitely think this deserves an execbuffer3 without even
>>>>> considering future requirements. Just  to burn down the old
>>>>> requirements and pointless fields.
>>>>>
>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>>> older sw on execbuf2 for ever.
>>>>
>>>> I guess another point in favour of execbuf3 would be that it's less
>>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>> reloc/softping paths.
>>>>
>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>> vfunc, and then we share code (where it even makes sense, probably
>>>> request setup/submit need to be shared, anything else is probably
>>>> cleaner to just copypaste) with the usual helper approach.
>>>>
>>>> Also that would guarantee that really none of the old concepts like
>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>> into the new vm_bind execbuf.
>>>>
>>>> Finally I also think that copypasting would make backporting easier,
>>>> or at least more flexible, since it should make it easier to have the
>>>> upstream vm_bind co-exist with all the other things we have. Without
>>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>>> of vfuncs into the existing code would cause.
>>>>
>>>> So maybe we should do this?
>>>
>>> Thanks Dave, Daniel.
>>> There are a few things that will be common between execbuf2 and
>>> execbuf3, like request setup/submit (as you said), fence handling 
>>> (timeline fences, fence array, composite fences), engine selection,
>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>> bit position will differ).
>>> But I guess these should be fine as the suggestion here is to
>>> copy-paste the execbuff code and having a shared code where possible.
>>> Besides, we can stop supporting some older feature in execbuff3
>>> (like fence array in favor of newer timeline fences), which will
>>> further reduce common code.
>>>
>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>
>>
>> Does this sound reasonable?
> 
> 
> Thanks for proposing this. Some comments below.
> 
> 
>>
>> struct drm_i915_gem_execbuffer3 {
>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>
>>        __u32 batch_count;
>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>> virtual addresses */
>>
>>        __u64 flags;
>> #define I915_EXEC3_RING_MASK              (0x3f)
>> #define I915_EXEC3_DEFAULT                (0<<0)
>> #define I915_EXEC3_RENDER                 (1<<0)
>> #define I915_EXEC3_BSD                    (2<<0)
>> #define I915_EXEC3_BLT                    (3<<0)
>> #define I915_EXEC3_VEBOX                  (4<<0)
> 
> 
> Shouldn't we use the new engine selection uAPI instead?
> 
> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in 
> drm_i915_gem_context_create_ext_setparam.
> 
> And you can also create virtual engines with the same extension.
> 
> It feels like this could be a single u32 with the engine index (in the 
> context engine map).

Yes I said the same yesterday.

Also note that as you can't any longer set engines on a default context, 
question is whether userspace cares to use execbuf3 with it (default 
context).

If it does, it will need an alternative engine selection for that case. 
I was proposing class:instance rather than legacy cumbersome flags.

If it does not, I  mean if the decision is to only allow execbuf3 with 
engine maps, then it leaves the default context a waste of kernel memory 
in the execbuf3 future. :( Don't know what to do there..

Regards,

Tvrtko

> 
> 
>>
>> #define I915_EXEC3_SECURE               (1<<6)
>> #define I915_EXEC3_IS_PINNED            (1<<7)
> 
> 
> What's the meaning of PINNED?
> 
> 
>>
>> #define I915_EXEC3_BSD_SHIFT     (8)
>> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>
>> #define I915_EXEC3_FENCE_IN             (1<<10)
>> #define I915_EXEC3_FENCE_OUT            (1<<11)
> 
> 
> For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 
> support, we only use that.
> 
> So there isn't much point for FENCE_IN/OUT.
> 
> Maybe check with other UMDs?
> 
> 
>> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)
> 
> 
> What's FENCE_SUBMIT?
> 
> 
>>
>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>
>>        __u64 extensions;        /* currently only for 
>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>> };
>>
>> With this, user can pass in batch addresses and count directly,
>> instead of as an extension (as this rfc series was proposing).
>>
>> I have removed many of the flags which were either legacy or not
>> applicable to BM_BIND mode.
>> I have also removed fence array support (execbuffer2.cliprects_ptr)
>> as we have timeline fence array support. Is that fine?
>> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>
>> Any thing else needs to be added or removed?
>>
>> Niranjana
>>
>>> Niranjana
>>>
>>>> -Daniel
>>>> -- 
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  8:36                 ` Tvrtko Ursulin
@ 2022-06-08  8:45                   ` Lionel Landwerlin
  2022-06-08  8:54                     ` Tvrtko Ursulin
  0 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  8:45 UTC (permalink / raw)
  To: Tvrtko Ursulin, Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 08/06/2022 11:36, Tvrtko Ursulin wrote:
>
> On 08/06/2022 07:40, Lionel Landwerlin wrote:
>> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>>> wrote:
>>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>
>>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura 
>>>>>>> wrote:
>>>>>>> >> VM_BIND and related uapi definitions
>>>>>>> >>
>>>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>> >>     Also add new uapi and documentation as per review comments
>>>>>>> >>     from Daniel.
>>>>>>> >>
>>>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>> >> ---
>>>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>> +++++++++++++++++++++++++++
>>>>>>> >>  1 file changed, 399 insertions(+)
>>>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>> >>
>>>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>> >> new file mode 100644
>>>>>>> >> index 000000000000..589c0a009107
>>>>>>> >> --- /dev/null
>>>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>> >> @@ -0,0 +1,399 @@
>>>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>>>> >> +/*
>>>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>>>> >> + */
>>>>>>> >> +
>>>>>>> >> +/**
>>>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>> >> + *
>>>>>>> >> + * VM_BIND feature availability.
>>>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>>>> >> + */
>>>>>>> >> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>>> >> +
>>>>>>> >> +/**
>>>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>> >> + *
>>>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM 
>>>>>>> creation.
>>>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>>>> >> + *
>>>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>>> mode of binding.
>>>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>> execlist (ie., the
>>>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>> must be provided
>>>>>>> >> + * to pass in the batch buffer addresses.
>>>>>>> >> + *
>>>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>>> must be 0
>>>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>>> must always be
>>>>>>> >> + * set (See struct 
>>>>>>> drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>> batch_len fields
>>>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>>> must be 0.
>>>>>>> >> + */
>>>>>>> >
>>>>>>> >From that description, it seems we have:
>>>>>>> >
>>>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since 
>>>>>>> using extensions
>>>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains 
>>>>>>> an actual pointer!
>>>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>>>> (new)
>>>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>> >        __u64 rsvd2;                    -> unused
>>>>>>> >};
>>>>>>> >
>>>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>>> instead
>>>>>>> >of adding even more complexity to an already abused interface? 
>>>>>>> While
>>>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>>>> >changing how the base struct should be interpreted based on how 
>>>>>>> the VM
>>>>>>> >was created (which is an entirely different ioctl).
>>>>>>> >
>>>>>>> >From Rusty Russel's API Design grading, 
>>>>>>> drm_i915_gem_execbuffer2 is
>>>>>>> >already at -6 without these changes. I think after vm_bind 
>>>>>>> we'll need
>>>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>>>> >
>>>>>>>
>>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>>> mode (other than natual extensions).
>>>>>>> Adding a new execbuffer3 was considered, but I think we need to 
>>>>>>> be careful
>>>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>>>> future
>>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>
>>>>>> Why not? it's not like adding extensions here is really that 
>>>>>> different
>>>>>> than adding new ioctls.
>>>>>>
>>>>>> I definitely think this deserves an execbuffer3 without even
>>>>>> considering future requirements. Just  to burn down the old
>>>>>> requirements and pointless fields.
>>>>>>
>>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, 
>>>>>> leave the
>>>>>> older sw on execbuf2 for ever.
>>>>>
>>>>> I guess another point in favour of execbuf3 would be that it's less
>>>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>>> reloc/softping paths.
>>>>>
>>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>>> vfunc, and then we share code (where it even makes sense, probably
>>>>> request setup/submit need to be shared, anything else is probably
>>>>> cleaner to just copypaste) with the usual helper approach.
>>>>>
>>>>> Also that would guarantee that really none of the old concepts like
>>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>>> into the new vm_bind execbuf.
>>>>>
>>>>> Finally I also think that copypasting would make backporting easier,
>>>>> or at least more flexible, since it should make it easier to have the
>>>>> upstream vm_bind co-exist with all the other things we have. Without
>>>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>>>> of vfuncs into the existing code would cause.
>>>>>
>>>>> So maybe we should do this?
>>>>
>>>> Thanks Dave, Daniel.
>>>> There are a few things that will be common between execbuf2 and
>>>> execbuf3, like request setup/submit (as you said), fence handling 
>>>> (timeline fences, fence array, composite fences), engine selection,
>>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>> bit position will differ).
>>>> But I guess these should be fine as the suggestion here is to
>>>> copy-paste the execbuff code and having a shared code where possible.
>>>> Besides, we can stop supporting some older feature in execbuff3
>>>> (like fence array in favor of newer timeline fences), which will
>>>> further reduce common code.
>>>>
>>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>>
>>>
>>> Does this sound reasonable?
>>
>>
>> Thanks for proposing this. Some comments below.
>>
>>
>>>
>>> struct drm_i915_gem_execbuffer3 {
>>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>
>>>        __u32 batch_count;
>>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>>> virtual addresses */
>>>
>>>        __u64 flags;
>>> #define I915_EXEC3_RING_MASK              (0x3f)
>>> #define I915_EXEC3_DEFAULT                (0<<0)
>>> #define I915_EXEC3_RENDER                 (1<<0)
>>> #define I915_EXEC3_BSD                    (2<<0)
>>> #define I915_EXEC3_BLT                    (3<<0)
>>> #define I915_EXEC3_VEBOX                  (4<<0)
>>
>>
>> Shouldn't we use the new engine selection uAPI instead?
>>
>> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES 
>> in drm_i915_gem_context_create_ext_setparam.
>>
>> And you can also create virtual engines with the same extension.
>>
>> It feels like this could be a single u32 with the engine index (in 
>> the context engine map).
>
> Yes I said the same yesterday.
>
> Also note that as you can't any longer set engines on a default 
> context, question is whether userspace cares to use execbuf3 with it 
> (default context).
>
> If it does, it will need an alternative engine selection for that 
> case. I was proposing class:instance rather than legacy cumbersome flags.
>
> If it does not, I  mean if the decision is to only allow execbuf3 with 
> engine maps, then it leaves the default context a waste of kernel 
> memory in the execbuf3 future. :( Don't know what to do there..
>
> Regards,
>
> Tvrtko


Thanks Tvrtko, I only saw your reply after responding.


Both Iris & Anv create a context with engines (if kernel supports it) : 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_gem.c#L73

I think we should be fine with just a single engine id and we don't care 
about the default context.


-Lionel


>
>>
>>
>>>
>>> #define I915_EXEC3_SECURE               (1<<6)
>>> #define I915_EXEC3_IS_PINNED            (1<<7)
>>
>>
>> What's the meaning of PINNED?
>>
>>
>>>
>>> #define I915_EXEC3_BSD_SHIFT     (8)
>>> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>>
>>> #define I915_EXEC3_FENCE_IN             (1<<10)
>>> #define I915_EXEC3_FENCE_OUT            (1<<11)
>>
>>
>> For Mesa, as soon as we have 
>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
>>
>> So there isn't much point for FENCE_IN/OUT.
>>
>> Maybe check with other UMDs?
>>
>>
>>> #define I915_EXEC3_FENCE_SUBMIT (1<<12)
>>
>>
>> What's FENCE_SUBMIT?
>>
>>
>>>
>>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>>
>>>        __u64 extensions;        /* currently only for 
>>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>> };
>>>
>>> With this, user can pass in batch addresses and count directly,
>>> instead of as an extension (as this rfc series was proposing).
>>>
>>> I have removed many of the flags which were either legacy or not
>>> applicable to BM_BIND mode.
>>> I have also removed fence array support (execbuffer2.cliprects_ptr)
>>> as we have timeline fence array support. Is that fine?
>>> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>>
>>> Any thing else needs to be added or removed?
>>>
>>> Niranjana
>>>
>>>> Niranjana
>>>>
>>>>> -Daniel
>>>>> -- 
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> http://blog.ffwll.ch
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  8:45                   ` Lionel Landwerlin
@ 2022-06-08  8:54                     ` Tvrtko Ursulin
  2022-06-08 20:45                         ` Niranjana Vishwanathapura
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  8:54 UTC (permalink / raw)
  To: Lionel Landwerlin, Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig


On 08/06/2022 09:45, Lionel Landwerlin wrote:
> On 08/06/2022 11:36, Tvrtko Ursulin wrote:
>>
>> On 08/06/2022 07:40, Lionel Landwerlin wrote:
>>> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>>>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>>>> wrote:
>>>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>
>>>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura 
>>>>>>>> wrote:
>>>>>>>> >> VM_BIND and related uapi definitions
>>>>>>>> >>
>>>>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>> >>     Also add new uapi and documentation as per review comments
>>>>>>>> >>     from Daniel.
>>>>>>>> >>
>>>>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>> >> ---
>>>>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>> +++++++++++++++++++++++++++
>>>>>>>> >>  1 file changed, 399 insertions(+)
>>>>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> >>
>>>>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> >> new file mode 100644
>>>>>>>> >> index 000000000000..589c0a009107
>>>>>>>> >> --- /dev/null
>>>>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> >> @@ -0,0 +1,399 @@
>>>>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>>>>> >> +/*
>>>>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>>>>> >> + */
>>>>>>>> >> +
>>>>>>>> >> +/**
>>>>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>> >> + *
>>>>>>>> >> + * VM_BIND feature availability.
>>>>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>>>>> >> + */
>>>>>>>> >> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>>>> >> +
>>>>>>>> >> +/**
>>>>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>> >> + *
>>>>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM 
>>>>>>>> creation.
>>>>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>> >> + *
>>>>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>>>> mode of binding.
>>>>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>>> execlist (ie., the
>>>>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>>> must be provided
>>>>>>>> >> + * to pass in the batch buffer addresses.
>>>>>>>> >> + *
>>>>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>>>> must be 0
>>>>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>>>> must always be
>>>>>>>> >> + * set (See struct 
>>>>>>>> drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>>> batch_len fields
>>>>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>>>> must be 0.
>>>>>>>> >> + */
>>>>>>>> >
>>>>>>>> >From that description, it seems we have:
>>>>>>>> >
>>>>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since 
>>>>>>>> using extensions
>>>>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains 
>>>>>>>> an actual pointer!
>>>>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>>>>> (new)
>>>>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>> >        __u64 rsvd2;                    -> unused
>>>>>>>> >};
>>>>>>>> >
>>>>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>>>> instead
>>>>>>>> >of adding even more complexity to an already abused interface? 
>>>>>>>> While
>>>>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>>>>> >changing how the base struct should be interpreted based on how 
>>>>>>>> the VM
>>>>>>>> >was created (which is an entirely different ioctl).
>>>>>>>> >
>>>>>>>> >From Rusty Russel's API Design grading, 
>>>>>>>> drm_i915_gem_execbuffer2 is
>>>>>>>> >already at -6 without these changes. I think after vm_bind 
>>>>>>>> we'll need
>>>>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>>>>> >
>>>>>>>>
>>>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>>>> mode (other than natual extensions).
>>>>>>>> Adding a new execbuffer3 was considered, but I think we need to 
>>>>>>>> be careful
>>>>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>>>>> future
>>>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>>
>>>>>>> Why not? it's not like adding extensions here is really that 
>>>>>>> different
>>>>>>> than adding new ioctls.
>>>>>>>
>>>>>>> I definitely think this deserves an execbuffer3 without even
>>>>>>> considering future requirements. Just  to burn down the old
>>>>>>> requirements and pointless fields.
>>>>>>>
>>>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, 
>>>>>>> leave the
>>>>>>> older sw on execbuf2 for ever.
>>>>>>
>>>>>> I guess another point in favour of execbuf3 would be that it's less
>>>>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>>>> reloc/softping paths.
>>>>>>
>>>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>>>> vfunc, and then we share code (where it even makes sense, probably
>>>>>> request setup/submit need to be shared, anything else is probably
>>>>>> cleaner to just copypaste) with the usual helper approach.
>>>>>>
>>>>>> Also that would guarantee that really none of the old concepts like
>>>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>>>> into the new vm_bind execbuf.
>>>>>>
>>>>>> Finally I also think that copypasting would make backporting easier,
>>>>>> or at least more flexible, since it should make it easier to have the
>>>>>> upstream vm_bind co-exist with all the other things we have. Without
>>>>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>>>>> of vfuncs into the existing code would cause.
>>>>>>
>>>>>> So maybe we should do this?
>>>>>
>>>>> Thanks Dave, Daniel.
>>>>> There are a few things that will be common between execbuf2 and
>>>>> execbuf3, like request setup/submit (as you said), fence handling 
>>>>> (timeline fences, fence array, composite fences), engine selection,
>>>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>>> bit position will differ).
>>>>> But I guess these should be fine as the suggestion here is to
>>>>> copy-paste the execbuff code and having a shared code where possible.
>>>>> Besides, we can stop supporting some older feature in execbuff3
>>>>> (like fence array in favor of newer timeline fences), which will
>>>>> further reduce common code.
>>>>>
>>>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>>>
>>>>
>>>> Does this sound reasonable?
>>>
>>>
>>> Thanks for proposing this. Some comments below.
>>>
>>>
>>>>
>>>> struct drm_i915_gem_execbuffer3 {
>>>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>>
>>>>        __u32 batch_count;
>>>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>>>> virtual addresses */
>>>>
>>>>        __u64 flags;
>>>> #define I915_EXEC3_RING_MASK              (0x3f)
>>>> #define I915_EXEC3_DEFAULT                (0<<0)
>>>> #define I915_EXEC3_RENDER                 (1<<0)
>>>> #define I915_EXEC3_BSD                    (2<<0)
>>>> #define I915_EXEC3_BLT                    (3<<0)
>>>> #define I915_EXEC3_VEBOX                  (4<<0)
>>>
>>>
>>> Shouldn't we use the new engine selection uAPI instead?
>>>
>>> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES 
>>> in drm_i915_gem_context_create_ext_setparam.
>>>
>>> And you can also create virtual engines with the same extension.
>>>
>>> It feels like this could be a single u32 with the engine index (in 
>>> the context engine map).
>>
>> Yes I said the same yesterday.
>>
>> Also note that as you can't any longer set engines on a default 
>> context, question is whether userspace cares to use execbuf3 with it 
>> (default context).
>>
>> If it does, it will need an alternative engine selection for that 
>> case. I was proposing class:instance rather than legacy cumbersome flags.
>>
>> If it does not, I  mean if the decision is to only allow execbuf3 with 
>> engine maps, then it leaves the default context a waste of kernel 
>> memory in the execbuf3 future. :( Don't know what to do there..
>>
>> Regards,
>>
>> Tvrtko
> 
> 
> Thanks Tvrtko, I only saw your reply after responding.
> 
> 
> Both Iris & Anv create a context with engines (if kernel supports it) : 
> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_gem.c#L73 
> 
> 
> I think we should be fine with just a single engine id and we don't care 
> about the default context.

I wonder if in this case we could stop creating the default context 
starting from a future "gen"? Otherwise, with engine map only execbuf3 
and execbuf3 only userspace, it would serve no purpose apart from 
wasting kernel memory.

Regards,

Tvrtko

> 
> 
> -Lionel
> 
> 
>>
>>>
>>>
>>>>
>>>> #define I915_EXEC3_SECURE               (1<<6)
>>>> #define I915_EXEC3_IS_PINNED            (1<<7)
>>>
>>>
>>> What's the meaning of PINNED?
>>>
>>>
>>>>
>>>> #define I915_EXEC3_BSD_SHIFT     (8)
>>>> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>>> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>>> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>>> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>>>
>>>> #define I915_EXEC3_FENCE_IN             (1<<10)
>>>> #define I915_EXEC3_FENCE_OUT            (1<<11)
>>>
>>>
>>> For Mesa, as soon as we have 
>>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
>>>
>>> So there isn't much point for FENCE_IN/OUT.
>>>
>>> Maybe check with other UMDs?
>>>
>>>
>>>> #define I915_EXEC3_FENCE_SUBMIT (1<<12)
>>>
>>>
>>> What's FENCE_SUBMIT?
>>>
>>>
>>>>
>>>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>>>
>>>>        __u64 extensions;        /* currently only for 
>>>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>>> };
>>>>
>>>> With this, user can pass in batch addresses and count directly,
>>>> instead of as an extension (as this rfc series was proposing).
>>>>
>>>> I have removed many of the flags which were either legacy or not
>>>> applicable to BM_BIND mode.
>>>> I have also removed fence array support (execbuffer2.cliprects_ptr)
>>>> as we have timeline fence array support. Is that fine?
>>>> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>>>
>>>> Any thing else needs to be added or removed?
>>>>
>>>> Niranjana
>>>>
>>>>> Niranjana
>>>>>
>>>>>> -Daniel
>>>>>> -- 
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> http://blog.ffwll.ch
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  7:17       ` Tvrtko Ursulin
@ 2022-06-08  9:12         ` Matthew Auld
  2022-06-08 21:32             ` Niranjana Vishwanathapura
  0 siblings, 1 reply; 121+ messages in thread
From: Matthew Auld @ 2022-06-08  9:12 UTC (permalink / raw)
  To: Tvrtko Ursulin, Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On 08/06/2022 08:17, Tvrtko Ursulin wrote:
> 
> On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>> VM_BIND and related uapi definitions
>>>>
>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>     Also add new uapi and documentation as per review comments
>>>>     from Daniel.
>>>>
>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>> <niranjana.vishwanathapura@intel.com>
>>>> ---
>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>>  1 file changed, 399 insertions(+)
>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>
>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>> new file mode 100644
>>>> index 000000000000..589c0a009107
>>>> --- /dev/null
>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>> @@ -0,0 +1,399 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2022 Intel Corporation
>>>> + */
>>>> +
>>>> +/**
>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>> + *
>>>> + * VM_BIND feature availability.
>>>> + * See typedef drm_i915_getparam_t param.
>>>> + */
>>>> +#define I915_PARAM_HAS_VM_BIND        57
>>>> +
>>>> +/**
>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>> + *
>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>> + * See struct drm_i915_gem_vm_control flags.
>>>> + *
>>>> + * A VM in VM_BIND mode will not support the older execbuff mode of 
>>>> binding.
>>>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>>> (ie., the
>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be 
>>>> provided
>>>> + * to pass in the batch buffer addresses.
>>>> + *
>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must 
>>>> always be
>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len 
>>>> fields
>>>> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>>> + */
>>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>> +
>>>> +/**
>>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>> + *
>>>> + * Flag to declare context as long running.
>>>> + * See struct drm_i915_gem_context_create_ext flags.
>>>> + *
>>>> + * Usage of dma-fence expects that they complete in reasonable 
>>>> amount of time.
>>>> + * Compute on the other hand can be long running. Hence it is not 
>>>> appropriate
>>>> + * for compute contexts to export request completion dma-fence to 
>>>> user.
>>>> + * The dma-fence usage will be limited to in-kernel consumption only.
>>>> + * Compute contexts need to use user/memory fence.
>>>> + *
>>>> + * So, long running contexts do not support output fences. Hence,
>>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are 
>>>> expected
>>>> + * to be not used.
>>>> + *
>>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects 
>>>> mapped
>>>> + * to long running contexts.
>>>> + */
>>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>> +
>>>> +/* VM_BIND related ioctls */
>>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>> +
>>>> +#define DRM_IOCTL_I915_GEM_VM_BIND        DRM_IOWR(DRM_COMMAND_BASE 
>>>> + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>> DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct 
>>>> drm_i915_gem_wait_user_fence)
>>>> +
>>>> +/**
>>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>> + *
>>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>>> mapping of GPU
>>>> + * virtual address (VA) range to the section of an object that 
>>>> should be bound
>>>> + * in the device page table of the specified address space (VM).
>>>> + * The VA range specified must be unique (ie., not currently bound) 
>>>> and can
>>>> + * be mapped to whole object or a section of the object (partial 
>>>> binding).
>>>> + * Multiple VA mappings can be created to the same section of the 
>>>> object
>>>> + * (aliasing).
>>>> + */
>>>> +struct drm_i915_gem_vm_bind {
>>>> +    /** @vm_id: VM (address space) id to bind */
>>>> +    __u32 vm_id;
>>>> +
>>>> +    /** @handle: Object handle */
>>>> +    __u32 handle;
>>>> +
>>>> +    /** @start: Virtual Address start to bind */
>>>> +    __u64 start;
>>>> +
>>>> +    /** @offset: Offset in object to bind */
>>>> +    __u64 offset;
>>>> +
>>>> +    /** @length: Length of mapping to bind */
>>>> +    __u64 length;
>>>
>>> Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? 
>>> Or if not userspace is expected to map the remainder of the space to 
>>> a dummy object? In which case would there be any alignment/padding 
>>> issues preventing the two bind to be placed next to each other?
>>>
>>> I ask because someone from the compute side asked me about a problem 
>>> with their strategy of dealing with overfetch and I suggested pad to 
>>> size.
>>>
>>
>> Thanks Tvrtko,
>> I think we shouldn't be needing it. As with VM_BIND VA assignment
>> is completely pushed to userspace, no padding should be necessary
>> once the 'start' and 'size' alignment conditions are met.
>>
>> I will add some documentation on alignment requirement here.
>> Generally, 'start' and 'size' should be 4K aligned. But, I think
>> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>> be 64K aligned.
> 
> + Matt
> 
> Align to 64k is enough for all overfetch issues?
> 
> Apparently compute has a situation where a buffer is received by one 
> component and another has to apply more alignment to it, to deal with 
> overfetch. Since they cannot grow the actual BO if they wanted to 
> VM_BIND a scratch area on top? Or perhaps none of this is a problem on 
> discrete and original BO should be correctly allocated to start with.
> 
> Side question - what about the align to 2MiB mentioned in 
> i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to 
> discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require 
a minimum of 64K pages underneath for local memory, and the BO size will 
also be rounded up accordingly. And yeah the complication arises due to 
not being able to mix 4K + 64K GTT pages within the same page-table 
(existed since even gen8). Note that 4K here is what we typically get 
for system memory.

Originally we had a memory coloring scheme to track the "color" of each 
page-table, which basically ensures that userspace can't do something 
nasty like mixing page sizes. The advantage of that scheme is that we 
would only require 64K GTT alignment and no extra padding, but is 
perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not 
vma->size) out of the vma to 2M, which is dead simple implementation 
wise, but does potentially waste some GTT space and some of the local 
memory used for the actual page-table. For the alignment the kernel just 
validates that the GTT address is aligned to 2M in vma_insert(), and 
then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: 
https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext

> 
> Regards,
> 
> Tvrtko
> 
>>
>> Niranjana
>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> +
>>>> +    /**
>>>> +     * @flags: Supported flags are,
>>>> +     *
>>>> +     * I915_GEM_VM_BIND_READONLY:
>>>> +     * Mapping is read-only.
>>>> +     *
>>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>>> +     * Capture this mapping in the dump upon GPU error.
>>>> +     */
>>>> +    __u64 flags;
>>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>> +
>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>> mapping. */
>>>> +    __u64 extensions;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>> + *
>>>> + * This structure is passed to VM_UNBIND ioctl and specifies the 
>>>> GPU virtual
>>>> + * address (VA) range that should be unbound from the device page 
>>>> table of the
>>>> + * specified address space (VM). The specified VA range must match 
>>>> one of the
>>>> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>>> + * completion.
>>>> + */
>>>> +struct drm_i915_gem_vm_unbind {
>>>> +    /** @vm_id: VM (address space) id to bind */
>>>> +    __u32 vm_id;
>>>> +
>>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>>> +    __u32 rsvd;
>>>> +
>>>> +    /** @start: Virtual Address start to unbind */
>>>> +    __u64 start;
>>>> +
>>>> +    /** @length: Length of mapping to unbind */
>>>> +    __u64 length;
>>>> +
>>>> +    /** @flags: reserved for future usage, currently MBZ */
>>>> +    __u64 flags;
>>>> +
>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>> mapping. */
>>>> +    __u64 extensions;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_vm_bind_fence - An input or output fence for the 
>>>> vm_bind
>>>> + * or the vm_unbind work.
>>>> + *
>>>> + * The vm_bind or vm_unbind aync worker will wait for input fence 
>>>> to signal
>>>> + * before starting the binding or unbinding.
>>>> + *
>>>> + * The vm_bind or vm_unbind async worker will signal the returned 
>>>> output fence
>>>> + * after the completion of binding or unbinding.
>>>> + */
>>>> +struct drm_i915_vm_bind_fence {
>>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>>> signal. */
>>>> +    __u32 handle;
>>>> +
>>>> +    /**
>>>> +     * @flags: Supported flags are,
>>>> +     *
>>>> +     * I915_VM_BIND_FENCE_WAIT:
>>>> +     * Wait for the input fence before binding/unbinding
>>>> +     *
>>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>>> +     * Return bind/unbind completion fence as output
>>>> +     */
>>>> +    __u32 flags;
>>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences 
>>>> for vm_bind
>>>> + * and vm_unbind.
>>>> + *
>>>> + * This structure describes an array of timeline drm_syncobj and 
>>>> associated
>>>> + * points for timeline variants of drm_syncobj. These timeline 
>>>> 'drm_syncobj's
>>>> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>>> + */
>>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>> +    struct i915_user_extension base;
>>>> +
>>>> +    /**
>>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>>> @value_ptr
>>>> +     * arrays.
>>>> +     */
>>>> +    __u64 fence_count;
>>>> +
>>>> +    /**
>>>> +     * @handles_ptr: Pointer to an array of struct 
>>>> drm_i915_vm_bind_fence
>>>> +     * of length @fence_count.
>>>> +     */
>>>> +    __u64 handles_ptr;
>>>> +
>>>> +    /**
>>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>>> +     * @fence_count.
>>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>> +     * timeline drm_syncobj is invalid as it turns a drm_syncobj 
>>>> into a
>>>> +     * binary one.
>>>> +     */
>>>> +    __u64 values_ptr;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_vm_bind_user_fence - An input or output user 
>>>> fence for the
>>>> + * vm_bind or the vm_unbind work.
>>>> + *
>>>> + * The vm_bind or vm_unbind aync worker will wait for the input 
>>>> fence (value at
>>>> + * @addr to become equal to @val) before starting the binding or 
>>>> unbinding.
>>>> + *
>>>> + * The vm_bind or vm_unbind async worker will signal the output 
>>>> fence after
>>>> + * the completion of binding or unbinding by writing @val to memory 
>>>> location at
>>>> + * @addr
>>>> + */
>>>> +struct drm_i915_vm_bind_user_fence {
>>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>>> address */
>>>> +    __u64 addr;
>>>> +
>>>> +    /** @val: User/Memory fence value to be written after bind 
>>>> completion */
>>>> +    __u64 val;
>>>> +
>>>> +    /**
>>>> +     * @flags: Supported flags are,
>>>> +     *
>>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>>> +     * Wait for the input fence before binding/unbinding
>>>> +     *
>>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>> +     * Return bind/unbind completion fence as output
>>>> +     */
>>>> +    __u32 flags;
>>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for 
>>>> vm_bind
>>>> + * and vm_unbind.
>>>> + *
>>>> + * These user fences can be input or output fences
>>>> + * (See struct drm_i915_vm_bind_user_fence).
>>>> + */
>>>> +struct drm_i915_vm_bind_ext_user_fence {
>>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>> +    struct i915_user_extension base;
>>>> +
>>>> +    /** @fence_count: Number of elements in the @user_fence_ptr 
>>>> array. */
>>>> +    __u64 fence_count;
>>>> +
>>>> +    /**
>>>> +     * @user_fence_ptr: Pointer to an array of
>>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>> +     */
>>>> +    __u64 user_fence_ptr;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of 
>>>> batch buffer
>>>> + * gpu virtual addresses.
>>>> + *
>>>> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), 
>>>> this extension
>>>> + * must always be appended in the VM_BIND mode and it will be an 
>>>> error to
>>>> + * append this extension in older non-VM_BIND mode.
>>>> + */
>>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>> +    struct i915_user_extension base;
>>>> +
>>>> +    /** @count: Number of addresses in the addr array. */
>>>> +    __u32 count;
>>>> +
>>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>>> +    __u64 addr[0];
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level 
>>>> batch completion
>>>> + * signaling extension.
>>>> + *
>>>> + * This extension allows user to attach a user fence (@addr, @value 
>>>> pair) to an
>>>> + * execbuf to be signaled by the command streamer after the 
>>>> completion of first
>>>> + * level batch, by writing the @value at specified @addr and 
>>>> triggering an
>>>> + * interrupt.
>>>> + * User can either poll for this user fence to signal or can also 
>>>> wait on it
>>>> + * with i915_gem_wait_user_fence ioctl.
>>>> + * This is very much usefaul for long running contexts where 
>>>> waiting on dma-fence
>>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>>> + */
>>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>> +    struct i915_user_extension base;
>>>> +
>>>> +    /**
>>>> +     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>> +     *
>>>> +     * Address has to be a valid GPU virtual address at the time of
>>>> +     * first level batch completion.
>>>> +     */
>>>> +    __u64 addr;
>>>> +
>>>> +    /**
>>>> +     * @value: User/Memory fence Value to be written to above address
>>>> +     * after first level batch completes.
>>>> +     */
>>>> +    __u64 value;
>>>> +
>>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>>> +    __u64 rsvd;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to make 
>>>> the object
>>>> + * private to the specified VM.
>>>> + *
>>>> + * See struct drm_i915_gem_create_ext.
>>>> + */
>>>> +struct drm_i915_gem_create_ext_vm_private {
>>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>> +    struct i915_user_extension base;
>>>> +
>>>> +    /** @vm_id: Id of the VM to which the object is private */
>>>> +    __u32 vm_id;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>> + *
>>>> + * User/Memory fence can be woken up either by:
>>>> + *
>>>> + * 1. GPU context indicated by @ctx_id, or,
>>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>> + *    @ctx_id is ignored when this flag is set.
>>>> + *
>>>> + * Wakeup condition is,
>>>> + * ``((*addr & mask) op (value & mask))``
>>>> + *
>>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>> <indefinite_dma_fences>`
>>>> + */
>>>> +struct drm_i915_gem_wait_user_fence {
>>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>>> +    __u64 extensions;
>>>> +
>>>> +    /** @addr: User/Memory fence address */
>>>> +    __u64 addr;
>>>> +
>>>> +    /** @ctx_id: Id of the Context which will signal the fence. */
>>>> +    __u32 ctx_id;
>>>> +
>>>> +    /** @op: Wakeup condition operator */
>>>> +    __u16 op;
>>>> +#define I915_UFENCE_WAIT_EQ      0
>>>> +#define I915_UFENCE_WAIT_NEQ     1
>>>> +#define I915_UFENCE_WAIT_GT      2
>>>> +#define I915_UFENCE_WAIT_GTE     3
>>>> +#define I915_UFENCE_WAIT_LT      4
>>>> +#define I915_UFENCE_WAIT_LTE     5
>>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>>> +#define I915_UFENCE_WAIT_AFTER   7
>>>> +
>>>> +    /**
>>>> +     * @flags: Supported flags are,
>>>> +     *
>>>> +     * I915_UFENCE_WAIT_SOFT:
>>>> +     *
>>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>>> +     *
>>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>>> +     *
>>>> +     * Wait timeout specified as absolute time.
>>>> +     */
>>>> +    __u16 flags;
>>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>> +
>>>> +    /** @value: Wakeup value */
>>>> +    __u64 value;
>>>> +
>>>> +    /** @mask: Wakeup mask */
>>>> +    __u64 mask;
>>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>> +
>>>> +    /**
>>>> +     * @timeout: Wait timeout in nanoseconds.
>>>> +     *
>>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout 
>>>> is the
>>>> +     * absolute time in nsec.
>>>> +     */
>>>> +    __s64 timeout;
>>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 2/3] drm/i915: Update i915 uapi documentation
  2022-05-17 18:32   ` Niranjana Vishwanathapura
@ 2022-06-08 11:24     ` Matthew Auld
  -1 siblings, 0 replies; 121+ messages in thread
From: Matthew Auld @ 2022-06-08 11:24 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Matthew Brost, Intel Graphics Development, ML dri-devel,
	Thomas Hellström, Chris Wilson, Jason Ekstrand,
	Daniel Vetter, Christian König

On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:
>
> Add some missing i915 upai documentation which the new
> i915 VM_BIND feature documentation will be refer to.
>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
>  1 file changed, 116 insertions(+), 37 deletions(-)
>
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index a2def7b27009..8c834a31b56f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
>
>  /* Must be kept compact -- no holes and well documented */
>
> +/**
> + * typedef drm_i915_getparam_t - Driver parameter query structure.

This one looks funny in the rendered html for some reason, since it
doesn't seem to emit the @param and @value, I guess it doesn't really
understand typedef <struct> ?

Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?

> + */
>  typedef struct drm_i915_getparam {
> +       /** @param: Driver parameter to query. */
>         __s32 param;
> -       /*
> +
> +       /**
> +        * @value: Address of memory where queried value should be put.
> +        *
>          * WARNING: Using pointers instead of fixed-size u64 means we need to write
>          * compat32 code. Don't repeat this mistake.
>          */
> @@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
>         __u64 rsvd2;
>  };
>
> +/**
> + * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff

s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.

> + * ioctl.
> + *
> + * The request will wait for input fence to signal before submission.
> + *
> + * The returned output fence will be signaled after the completion of the
> + * request.
> + */
>  struct drm_i915_gem_exec_fence {
> -       /**
> -        * User's handle for a drm_syncobj to wait on or signal.
> -        */
> +       /** @handle: User's handle for a drm_syncobj to wait on or signal. */
>         __u32 handle;
>
> +       /**
> +        * @flags: Supported flags are,

are:

> +        *
> +        * I915_EXEC_FENCE_WAIT:
> +        * Wait for the input fence before request submission.
> +        *
> +        * I915_EXEC_FENCE_SIGNAL:
> +        * Return request completion fence as output
> +        */
> +       __u32 flags;
>  #define I915_EXEC_FENCE_WAIT            (1<<0)
>  #define I915_EXEC_FENCE_SIGNAL          (1<<1)
>  #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
> -       __u32 flags;
>  };
>
> -/*
> - * See drm_i915_gem_execbuffer_ext_timeline_fences.
> - */
> -#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
> -
> -/*
> +/**
> + * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
> + * for execbuff.
> + *
>   * This structure describes an array of drm_syncobj and associated points for
>   * timeline variants of drm_syncobj. It is invalid to append this structure to
>   * the execbuf if I915_EXEC_FENCE_ARRAY is set.
>   */
>  struct drm_i915_gem_execbuffer_ext_timeline_fences {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
> +       /** @base: Extension link. See struct i915_user_extension. */
>         struct i915_user_extension base;
>
>         /**
> -        * Number of element in the handles_ptr & value_ptr arrays.
> +        * @fence_count: Number of element in the @handles_ptr & @value_ptr

s/element/elements/

> +        * arrays.
>          */
>         __u64 fence_count;
>
>         /**
> -        * Pointer to an array of struct drm_i915_gem_exec_fence of length
> -        * fence_count.
> +        * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
> +        * of length @fence_count.
>          */
>         __u64 handles_ptr;
>
>         /**
> -        * Pointer to an array of u64 values of length fence_count. Values
> -        * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
> -        * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
> +        * @values_ptr: Pointer to an array of u64 values of length
> +        * @fence_count.
> +        * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
> +        * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +        * binary one.
>          */
>         __u64 values_ptr;
>  };
>
> +/**
> + * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
> + */
>  struct drm_i915_gem_execbuffer2 {
> -       /**
> -        * List of gem_exec_object2 structs
> -        */
> +       /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
>         __u64 buffers_ptr;
> +
> +       /** @buffer_count: Number of elements in @buffers_ptr array */
>         __u32 buffer_count;
>
> -       /** Offset in the batchbuffer to start execution from. */
> +       /**
> +        * @batch_start_offset: Offset in the batchbuffer to start execution
> +        * from.
> +        */
>         __u32 batch_start_offset;
> -       /** Bytes used in batchbuffer from batch_start_offset */
> +
> +       /** @batch_len: Bytes used in batchbuffer from batch_start_offset */

"Length in bytes of the batchbuffer, otherwise assumed to be the
object size if zero, starting from the @batch_start_offset."

>         __u32 batch_len;
> +
> +       /** @DR1: deprecated */
>         __u32 DR1;
> +
> +       /** @DR4: deprecated */
>         __u32 DR4;
> +
> +       /** @num_cliprects: See @cliprects_ptr */
>         __u32 num_cliprects;
> +
>         /**
> -        * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
> -        * & I915_EXEC_USE_EXTENSIONS are not set.
> +        * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
> +        *
> +        * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
> +        * I915_EXEC_USE_EXTENSIONS flags are not set.
>          *
>          * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
> -        * of struct drm_i915_gem_exec_fence and num_cliprects is the length
> -        * of the array.
> +        * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
> +        * array.
>          *
>          * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
> -        * single struct i915_user_extension and num_cliprects is 0.
> +        * single &i915_user_extension and num_cliprects is 0.
>          */
>         __u64 cliprects_ptr;
> +
> +       /** @flags: Execbuff flags */

s/Execbuff/Execbuf/

Could maybe document the I915_EXEC_* also, or maybe not ;)

> +       __u64 flags;
>  #define I915_EXEC_RING_MASK              (0x3f)
>  #define I915_EXEC_DEFAULT                (0<<0)
>  #define I915_EXEC_RENDER                 (1<<0)
> @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
>  #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
>  #define I915_EXEC_CONSTANTS_ABSOLUTE   (1<<6)
>  #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
> -       __u64 flags;
> -       __u64 rsvd1; /* now used for context info */
> -       __u64 rsvd2;
> -};
>
>  /** Resets the SO write offset registers for transform feedback on gen7. */
>  #define I915_EXEC_GEN7_SOL_RESET       (1<<8)
> @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
>   * drm_i915_gem_execbuffer_ext enum.
>   */
>  #define I915_EXEC_USE_EXTENSIONS       (1 << 21)
> -
>  #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
>
> +       /** @rsvd1: Context id */
> +       __u64 rsvd1;
> +
> +       /**
> +        * @rsvd2: in and out sync_file file descriptors.
> +        *
> +        * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
> +        * lower 32 bits of this field will have the in sync_file fd (input).
> +        *
> +        * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
> +        * field will have the out sync_file fd (output).
> +        */
> +       __u64 rsvd2;
> +};
> +
>  #define I915_EXEC_CONTEXT_ID_MASK      (0xffffffff)
>  #define i915_execbuffer2_set_context_id(eb2, context) \
>         (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
> @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
>         __u32 pad;
>  };
>
> +/**
> + * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
> + */
>  struct drm_i915_gem_context_create_ext {
> -       __u32 ctx_id; /* output: id of new context*/
> +       /** @ctx_id: Id of the created context (output) */
> +       __u32 ctx_id;
> +
> +       /**
> +        * @flags: Supported flags are,

are:

> +        *
> +        * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
> +        *
> +        * Extensions may be appended to this structure and driver must check
> +        * for those.

Maybe add "See @extensions.", and then....

> +        *
> +        * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
> +        *
> +        * Created context will have single timeline.
> +        */
>         __u32 flags;
>  #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS       (1u << 0)
>  #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE      (1u << 1)
>  #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
>         (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
> +
> +       /** @extensions: Zero-terminated chain of extensions. */

...here perhaps list the extensions, and maybe also move the #define
for each here? See for example @extensions in drm_i915_gem_create_ext.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>

>         __u64 extensions;
>  };
>
> @@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
>         __u32 pad;
>  };
>
> -/*
> +/**
> + * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
> + *
>   * DRM_I915_GEM_VM_CREATE -
>   *
>   * Create a new virtual memory address space (ppGTT) for use within a context
> @@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
>   * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
>   * returned in the outparam @id.
>   *
> - * No flags are defined, with all bits reserved and must be zero.
> - *
>   * An extension chain maybe provided, starting with @extensions, and terminated
>   * by the @next_extension being 0. Currently, no extensions are defined.
>   *
>   * DRM_I915_GEM_VM_DESTROY -
>   *
> - * Destroys a previously created VM id, specified in @id.
> + * Destroys a previously created VM id, specified in @vm_id.
>   *
>   * No extensions or flags are allowed currently, and so must be zero.
>   */
>  struct drm_i915_gem_vm_control {
> +       /** @extensions: Zero-terminated chain of extensions. */
>         __u64 extensions;
> +
> +       /** @flags: reserved for future usage, currently MBZ */
>         __u32 flags;
> +
> +       /** @vm_id: Id of the VM created or to be destroyed */
>         __u32 vm_id;
>  };
>
> --
> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 2/3] drm/i915: Update i915 uapi documentation
@ 2022-06-08 11:24     ` Matthew Auld
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Auld @ 2022-06-08 11:24 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel Graphics Development, ML dri-devel, Thomas Hellström,
	Chris Wilson, Daniel Vetter, Christian König

On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:
>
> Add some missing i915 upai documentation which the new
> i915 VM_BIND feature documentation will be refer to.
>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
>  1 file changed, 116 insertions(+), 37 deletions(-)
>
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index a2def7b27009..8c834a31b56f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
>
>  /* Must be kept compact -- no holes and well documented */
>
> +/**
> + * typedef drm_i915_getparam_t - Driver parameter query structure.

This one looks funny in the rendered html for some reason, since it
doesn't seem to emit the @param and @value, I guess it doesn't really
understand typedef <struct> ?

Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?

> + */
>  typedef struct drm_i915_getparam {
> +       /** @param: Driver parameter to query. */
>         __s32 param;
> -       /*
> +
> +       /**
> +        * @value: Address of memory where queried value should be put.
> +        *
>          * WARNING: Using pointers instead of fixed-size u64 means we need to write
>          * compat32 code. Don't repeat this mistake.
>          */
> @@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
>         __u64 rsvd2;
>  };
>
> +/**
> + * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff

s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.

> + * ioctl.
> + *
> + * The request will wait for input fence to signal before submission.
> + *
> + * The returned output fence will be signaled after the completion of the
> + * request.
> + */
>  struct drm_i915_gem_exec_fence {
> -       /**
> -        * User's handle for a drm_syncobj to wait on or signal.
> -        */
> +       /** @handle: User's handle for a drm_syncobj to wait on or signal. */
>         __u32 handle;
>
> +       /**
> +        * @flags: Supported flags are,

are:

> +        *
> +        * I915_EXEC_FENCE_WAIT:
> +        * Wait for the input fence before request submission.
> +        *
> +        * I915_EXEC_FENCE_SIGNAL:
> +        * Return request completion fence as output
> +        */
> +       __u32 flags;
>  #define I915_EXEC_FENCE_WAIT            (1<<0)
>  #define I915_EXEC_FENCE_SIGNAL          (1<<1)
>  #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
> -       __u32 flags;
>  };
>
> -/*
> - * See drm_i915_gem_execbuffer_ext_timeline_fences.
> - */
> -#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
> -
> -/*
> +/**
> + * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
> + * for execbuff.
> + *
>   * This structure describes an array of drm_syncobj and associated points for
>   * timeline variants of drm_syncobj. It is invalid to append this structure to
>   * the execbuf if I915_EXEC_FENCE_ARRAY is set.
>   */
>  struct drm_i915_gem_execbuffer_ext_timeline_fences {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
> +       /** @base: Extension link. See struct i915_user_extension. */
>         struct i915_user_extension base;
>
>         /**
> -        * Number of element in the handles_ptr & value_ptr arrays.
> +        * @fence_count: Number of element in the @handles_ptr & @value_ptr

s/element/elements/

> +        * arrays.
>          */
>         __u64 fence_count;
>
>         /**
> -        * Pointer to an array of struct drm_i915_gem_exec_fence of length
> -        * fence_count.
> +        * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
> +        * of length @fence_count.
>          */
>         __u64 handles_ptr;
>
>         /**
> -        * Pointer to an array of u64 values of length fence_count. Values
> -        * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
> -        * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
> +        * @values_ptr: Pointer to an array of u64 values of length
> +        * @fence_count.
> +        * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
> +        * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +        * binary one.
>          */
>         __u64 values_ptr;
>  };
>
> +/**
> + * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
> + */
>  struct drm_i915_gem_execbuffer2 {
> -       /**
> -        * List of gem_exec_object2 structs
> -        */
> +       /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
>         __u64 buffers_ptr;
> +
> +       /** @buffer_count: Number of elements in @buffers_ptr array */
>         __u32 buffer_count;
>
> -       /** Offset in the batchbuffer to start execution from. */
> +       /**
> +        * @batch_start_offset: Offset in the batchbuffer to start execution
> +        * from.
> +        */
>         __u32 batch_start_offset;
> -       /** Bytes used in batchbuffer from batch_start_offset */
> +
> +       /** @batch_len: Bytes used in batchbuffer from batch_start_offset */

"Length in bytes of the batchbuffer, otherwise assumed to be the
object size if zero, starting from the @batch_start_offset."

>         __u32 batch_len;
> +
> +       /** @DR1: deprecated */
>         __u32 DR1;
> +
> +       /** @DR4: deprecated */
>         __u32 DR4;
> +
> +       /** @num_cliprects: See @cliprects_ptr */
>         __u32 num_cliprects;
> +
>         /**
> -        * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
> -        * & I915_EXEC_USE_EXTENSIONS are not set.
> +        * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
> +        *
> +        * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
> +        * I915_EXEC_USE_EXTENSIONS flags are not set.
>          *
>          * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
> -        * of struct drm_i915_gem_exec_fence and num_cliprects is the length
> -        * of the array.
> +        * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
> +        * array.
>          *
>          * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
> -        * single struct i915_user_extension and num_cliprects is 0.
> +        * single &i915_user_extension and num_cliprects is 0.
>          */
>         __u64 cliprects_ptr;
> +
> +       /** @flags: Execbuff flags */

s/Execbuff/Execbuf/

Could maybe document the I915_EXEC_* also, or maybe not ;)

> +       __u64 flags;
>  #define I915_EXEC_RING_MASK              (0x3f)
>  #define I915_EXEC_DEFAULT                (0<<0)
>  #define I915_EXEC_RENDER                 (1<<0)
> @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
>  #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
>  #define I915_EXEC_CONSTANTS_ABSOLUTE   (1<<6)
>  #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
> -       __u64 flags;
> -       __u64 rsvd1; /* now used for context info */
> -       __u64 rsvd2;
> -};
>
>  /** Resets the SO write offset registers for transform feedback on gen7. */
>  #define I915_EXEC_GEN7_SOL_RESET       (1<<8)
> @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
>   * drm_i915_gem_execbuffer_ext enum.
>   */
>  #define I915_EXEC_USE_EXTENSIONS       (1 << 21)
> -
>  #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
>
> +       /** @rsvd1: Context id */
> +       __u64 rsvd1;
> +
> +       /**
> +        * @rsvd2: in and out sync_file file descriptors.
> +        *
> +        * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
> +        * lower 32 bits of this field will have the in sync_file fd (input).
> +        *
> +        * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
> +        * field will have the out sync_file fd (output).
> +        */
> +       __u64 rsvd2;
> +};
> +
>  #define I915_EXEC_CONTEXT_ID_MASK      (0xffffffff)
>  #define i915_execbuffer2_set_context_id(eb2, context) \
>         (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
> @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
>         __u32 pad;
>  };
>
> +/**
> + * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
> + */
>  struct drm_i915_gem_context_create_ext {
> -       __u32 ctx_id; /* output: id of new context*/
> +       /** @ctx_id: Id of the created context (output) */
> +       __u32 ctx_id;
> +
> +       /**
> +        * @flags: Supported flags are,

are:

> +        *
> +        * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
> +        *
> +        * Extensions may be appended to this structure and driver must check
> +        * for those.

Maybe add "See @extensions.", and then....

> +        *
> +        * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
> +        *
> +        * Created context will have single timeline.
> +        */
>         __u32 flags;
>  #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS       (1u << 0)
>  #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE      (1u << 1)
>  #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
>         (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
> +
> +       /** @extensions: Zero-terminated chain of extensions. */

...here perhaps list the extensions, and maybe also move the #define
for each here? See for example @extensions in drm_i915_gem_create_ext.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>

>         __u64 extensions;
>  };
>
> @@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
>         __u32 pad;
>  };
>
> -/*
> +/**
> + * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
> + *
>   * DRM_I915_GEM_VM_CREATE -
>   *
>   * Create a new virtual memory address space (ppGTT) for use within a context
> @@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
>   * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
>   * returned in the outparam @id.
>   *
> - * No flags are defined, with all bits reserved and must be zero.
> - *
>   * An extension chain maybe provided, starting with @extensions, and terminated
>   * by the @next_extension being 0. Currently, no extensions are defined.
>   *
>   * DRM_I915_GEM_VM_DESTROY -
>   *
> - * Destroys a previously created VM id, specified in @id.
> + * Destroys a previously created VM id, specified in @vm_id.
>   *
>   * No extensions or flags are allowed currently, and so must be zero.
>   */
>  struct drm_i915_gem_vm_control {
> +       /** @extensions: Zero-terminated chain of extensions. */
>         __u64 extensions;
> +
> +       /** @flags: reserved for future usage, currently MBZ */
>         __u32 flags;
> +
> +       /** @vm_id: Id of the VM created or to be destroyed */
>         __u32 vm_id;
>  };
>
> --
> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  7:34                   ` Tvrtko Ursulin
@ 2022-06-08 19:52                     ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 19:52 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, Jun 08, 2022 at 08:34:36AM +0100, Tvrtko Ursulin wrote:
>
>On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
>>>
>>>On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
>>>>On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana 
>>>>Vishwanathapura wrote:
>>>>>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>
>>>>>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>
>>>>>>>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>>>>>>VM_BIND and related uapi definitions
>>>>>>>>>>
>>>>>>>>>>v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>>>     from Daniel.
>>>>>>>>>>
>>>>>>>>>>Signed-off-by: Niranjana Vishwanathapura
>>>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>>>>---
>>>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399
>>>>>>>>+++++++++++++++++++++++++++
>>>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>
>>>>>>>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>new file mode 100644
>>>>>>>>>>index 000000000000..589c0a009107
>>>>>>>>>>--- /dev/null
>>>>>>>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>@@ -0,0 +1,399 @@
>>>>>>>>>>+/* SPDX-License-Identifier: MIT */
>>>>>>>>>>+/*
>>>>>>>>>>+ * Copyright © 2022 Intel Corporation
>>>>>>>>>>+ */
>>>>>>>>>>+
>>>>>>>>>>+/**
>>>>>>>>>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>>>+ *
>>>>>>>>>>+ * VM_BIND feature availability.
>>>>>>>>>>+ * See typedef drm_i915_getparam_t param.
>>>>>>>>>>+ */
>>>>>>>>>>+#define I915_PARAM_HAS_VM_BIND               57
>>>>>>>>>>+
>>>>>>>>>>+/**
>>>>>>>>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>>>+ *
>>>>>>>>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>>>>+ * See struct drm_i915_gem_vm_control flags.
>>>>>>>>>>+ *
>>>>>>>>>>+ * A VM in VM_BIND mode will not support the older
>>>>>>>>execbuff mode of binding.
>>>>>>>>>>+ * In VM_BIND mode, execbuff ioctl will not accept 
>>>>>>>>>>any
>>>>>>>>execlist (ie., the
>>>>>>>>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 
>>>>>>>>>>extension
>>>>>>>>must be provided
>>>>>>>>>>+ * to pass in the batch buffer addresses.
>>>>>>>>>>+ *
>>>>>>>>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>>>+ * I915_EXEC_BATCH_FIRST of
>>>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>>>+ * (not used) in VM_BIND mode. 
>>>>>>>>>>I915_EXEC_USE_EXTENSIONS
>>>>>>>>flag must always be
>>>>>>>>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>>>+ * The buffers_ptr, buffer_count, 
>>>>>>>>>>batch_start_offset and
>>>>>>>>batch_len fields
>>>>>>>>>>+ * of struct drm_i915_gem_execbuffer2 are also not 
>>>>>>>>>>used
>>>>>>>>and must be 0.
>>>>>>>>>>+ */
>>>>>>>>>
>>>>>>>>>From that description, it seems we have:
>>>>>>>>>
>>>>>>>>>struct drm_i915_gem_execbuffer2 {
>>>>>>>>>        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>>        __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>>        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>>        __u32 batch_len;                -> must be 0 (new)
>>>>>>>>>        __u32 DR1;                      -> must be 0 (old)
>>>>>>>>>        __u32 DR4;                      -> must be 0 (old)
>>>>>>>>>        __u32 num_cliprects; (fences)   -> must be 0 
>>>>>>>>>since
>>>>>>>>using extensions
>>>>>>>>>        __u64 cliprects_ptr; (fences, extensions) ->
>>>>>>>>contains an actual pointer!
>>>>>>>>>        __u64 flags;                    -> some flags 
>>>>>>>>>must be 0 (new)
>>>>>>>>>        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>>>        __u64 rsvd2;                    -> unused
>>>>>>>>>};
>>>>>>>>>
>>>>>>>>>Based on that, why can't we just get 
>>>>>>>>>drm_i915_gem_execbuffer3 instead
>>>>>>>>>of adding even more complexity to an already abused 
>>>>>>>>>interface? While
>>>>>>>>>the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>>>we're doing here is extending the ioctl usage, we're completely
>>>>>>>>>changing how the base struct should be interpreted 
>>>>>>>>>based on
>>>>>>>>how the VM
>>>>>>>>>was created (which is an entirely different ioctl).
>>>>>>>>>
>>>>>>>>>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>>>>>already at -6 without these changes. I think after 
>>>>>>>>>vm_bind we'll need
>>>>>>>>>to create a -11 entry just to deal with this ioctl.
>>>>>>>>>
>>>>>>>>
>>>>>>>>The only change here is removing the execlist support for VM_BIND
>>>>>>>>mode (other than natual extensions).
>>>>>>>>Adding a new execbuffer3 was considered, but I think we 
>>>>>>>>need to be careful
>>>>>>>>with that as that goes beyond the VM_BIND support, 
>>>>>>>>including any future
>>>>>>>>requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>>
>>>>>>>Why not? it's not like adding extensions here is really 
>>>>>>>that different
>>>>>>>than adding new ioctls.
>>>>>>>
>>>>>>>I definitely think this deserves an execbuffer3 without even
>>>>>>>considering future requirements. Just  to burn down the old
>>>>>>>requirements and pointless fields.
>>>>>>>
>>>>>>>Make execbuffer3 be vm bind only, no relocs, no legacy 
>>>>>>>bits, leave the
>>>>>>>older sw on execbuf2 for ever.
>>>>>>
>>>>>>I guess another point in favour of execbuf3 would be that it's less
>>>>>>midlayer. If we share the entry point then there's quite a few vfuncs
>>>>>>needed to cleanly split out the vm_bind paths from the legacy
>>>>>>reloc/softping paths.
>>>>>>
>>>>>>If we invert this and do execbuf3, then there's the existing ioctl
>>>>>>vfunc, and then we share code (where it even makes sense, probably
>>>>>>request setup/submit need to be shared, anything else is probably
>>>>>>cleaner to just copypaste) with the usual helper approach.
>>>>>>
>>>>>>Also that would guarantee that really none of the old concepts like
>>>>>>i915_active on the vma or vma open counts and all that stuff leaks
>>>>>>into the new vm_bind execbuf.
>>>>>>
>>>>>>Finally I also think that copypasting would make backporting easier,
>>>>>>or at least more flexible, since it should make it easier to have the
>>>>>>upstream vm_bind co-exist with all the other things we have. Without
>>>>>>huge amounts of conflicts (or at least much less) that pushing a pile
>>>>>>of vfuncs into the existing code would cause.
>>>>>>
>>>>>>So maybe we should do this?
>>>>>
>>>>>Thanks Dave, Daniel.
>>>>>There are a few things that will be common between execbuf2 and
>>>>>execbuf3, like request setup/submit (as you said), fence 
>>>>>handling (timeline fences, fence array, composite fences), 
>>>>>engine selection,
>>>>>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>>>bit position will differ).
>>>>>But I guess these should be fine as the suggestion here is to
>>>>>copy-paste the execbuff code and having a shared code where possible.
>>>>>Besides, we can stop supporting some older feature in execbuff3
>>>>>(like fence array in favor of newer timeline fences), which will
>>>>>further reduce common code.
>>>>>
>>>>>Ok, I will update this series by adding execbuf3 and send out soon.
>>>>>
>>>>
>>>>Does this sound reasonable?
>>>>
>>>>struct drm_i915_gem_execbuffer3 {
>>>>       __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>>
>>>>       __u32 batch_count;
>>>>       __u64 batch_addr_ptr;    /* Pointer to an array of batch 
>>>>gpu virtual addresses */
>>>
>>>Casual stumble upon..
>>>
>>>Alternatively you could embed N pointers to make life a bit easier 
>>>for both userspace and kernel side. Yes, but then "N batch buffers 
>>>should be enough for everyone" problem.. :)
>>>
>>
>>Thanks Tvrtko,
>>Yes, hence the batch_addr_ptr.
>
>Right, but then userspace has to allocate a separate buffer and kernel 
>has to access it separately from a single copy_from_user. Pros and 
>cons of "this many batches should be enough for everyone" versus the 
>extra operations.
>
>Hmm.. for the common case of one batch - you could define the uapi to 
>say if batch_count is one then pointer is GPU VA to the batch itself, 
>not a pointer to userspace array of GPU VA?
>

Yah, we can do that. ie., batch_addr_ptr is the batch VA when batch_count
is 1. Otherwise, it is pointer to an array of batch VAs.

Other option is to move multi-batch support to an extension and here
we will only have batch_addr (ie., support for 1 batch only).

I like the former one better (the one you suggested).

Niranjana

>Regards,
>
>Tvrtko
>
>>>>       __u64 flags;
>>>>#define I915_EXEC3_RING_MASK              (0x3f)
>>>>#define I915_EXEC3_DEFAULT                (0<<0)
>>>>#define I915_EXEC3_RENDER                 (1<<0)
>>>>#define I915_EXEC3_BSD                    (2<<0)
>>>>#define I915_EXEC3_BLT                    (3<<0)
>>>>#define I915_EXEC3_VEBOX                  (4<<0)
>>>>
>>>>#define I915_EXEC3_SECURE               (1<<6)
>>>>#define I915_EXEC3_IS_PINNED            (1<<7)
>>>>
>>>>#define I915_EXEC3_BSD_SHIFT     (8)
>>>>#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>>>#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>>>#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>>>#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>>
>>>I'd suggest legacy engine selection is unwanted, especially not 
>>>with the convoluted BSD1/2 flags. Can we just require context with 
>>>engine map and index? Or if default context has to be supported 
>>>then I'd suggest ...class_instance for that mode.
>>>
>>
>>Ok, I will be happy to remove it and only support contexts with
>>engine map, if UMDs agree on that.
>>
>>>>#define I915_EXEC3_FENCE_IN             (1<<10)
>>>>#define I915_EXEC3_FENCE_OUT            (1<<11)
>>>>#define I915_EXEC3_FENCE_SUBMIT         (1<<12)
>>>
>>>People are likely to object to submit fence since generic 
>>>mechanism to align submissions was rejected.
>>>
>>
>>Ok, again, I can remove it if UMDs are ok with it.
>>
>>>>
>>>>       __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>>
>>>New ioctl you can afford dedicated fields.
>>>
>>
>>Yes, but as I asked below, I am not sure if we need this or the
>>timeline fence arry extension we have is good enough.
>>
>>>In any case I suggest you involve UMD folks in designing it.
>>>
>>
>>Yah.
>>Paulo, Lionel, Jason, Daniel, can you comment on these regarding
>>what will UMD need in execbuf3 and what can be removed?
>>
>>Thanks,
>>Niranjana
>>
>>>Regards,
>>>
>>>Tvrtko
>>>
>>>>
>>>>       __u64 extensions;        /* currently only for 
>>>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>>>};
>>>>
>>>>With this, user can pass in batch addresses and count directly,
>>>>instead of as an extension (as this rfc series was proposing).
>>>>
>>>>I have removed many of the flags which were either legacy or not
>>>>applicable to BM_BIND mode.
>>>>I have also removed fence array support (execbuffer2.cliprects_ptr)
>>>>as we have timeline fence array support. Is that fine?
>>>>Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>>>
>>>>Any thing else needs to be added or removed?
>>>>
>>>>Niranjana
>>>>
>>>>>Niranjana
>>>>>
>>>>>>-Daniel
>>>>>>-- 
>>>>>>Daniel Vetter
>>>>>>Software Engineer, Intel Corporation
>>>>>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  8:54                     ` Tvrtko Ursulin
@ 2022-06-08 20:45                         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 20:45 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Wilson, Chris P, Zanoni, Paulo R, intel-gfx, dri-devel,
	Hellstrom, Thomas, Lionel Landwerlin, Vetter, Daniel,
	christian.koenig

On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:
>
>On 08/06/2022 09:45, Lionel Landwerlin wrote:
>>On 08/06/2022 11:36, Tvrtko Ursulin wrote:
>>>
>>>On 08/06/2022 07:40, Lionel Landwerlin wrote:
>>>>On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>>>>>On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana 
>>>>>Vishwanathapura wrote:
>>>>>>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>>>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>>
>>>>>>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>
>>>>>>>>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana 
>>>>>>>>>Vishwanathapura wrote:
>>>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>>>
>>>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>>>>     from Daniel.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>>>+++++++++++++++++++++++++++
>>>>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>> new file mode 100644
>>>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>>>> --- /dev/null
>>>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>>>> +/*
>>>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>>>> + */
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>>>> + *
>>>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>>>> + */
>>>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>>>> + *
>>>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding 
>>>>>>>>>during VM creation.
>>>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>>>> + *
>>>>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>>>>execbuff mode of binding.
>>>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept 
>>>>>>>>>any execlist (ie., the
>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 
>>>>>>>>>extension must be provided
>>>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>>>> + *
>>>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>>>> + * (not used) in VM_BIND mode. 
>>>>>>>>>I915_EXEC_USE_EXTENSIONS flag must always be
>>>>>>>>>>> + * set (See struct 
>>>>>>>>>drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>>>> + * The buffers_ptr, buffer_count, 
>>>>>>>>>batch_start_offset and batch_len fields
>>>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not 
>>>>>>>>>used and must be 0.
>>>>>>>>>>> + */
>>>>>>>>>>
>>>>>>>>>>From that description, it seems we have:
>>>>>>>>>>
>>>>>>>>>>struct drm_i915_gem_execbuffer2 {
>>>>>>>>>>        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>>>        __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>>>        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>>>        __u32 batch_len;                -> must be 0 (new)
>>>>>>>>>>        __u32 DR1;                      -> must be 0 (old)
>>>>>>>>>>        __u32 DR4;                      -> must be 0 (old)
>>>>>>>>>>        __u32 num_cliprects; (fences)   -> must be 0 
>>>>>>>>>since using extensions
>>>>>>>>>>        __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>>>>contains an actual pointer!
>>>>>>>>>>        __u64 flags;                    -> some flags 
>>>>>>>>>must be 0 (new)
>>>>>>>>>>        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>>>>        __u64 rsvd2;                    -> unused
>>>>>>>>>>};
>>>>>>>>>>
>>>>>>>>>>Based on that, why can't we just get 
>>>>>>>>>drm_i915_gem_execbuffer3 instead
>>>>>>>>>>of adding even more complexity to an already abused 
>>>>>>>>>interface? While
>>>>>>>>>>the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>>>>we're doing here is extending the ioctl usage, we're completely
>>>>>>>>>>changing how the base struct should be interpreted 
>>>>>>>>>based on how the VM
>>>>>>>>>>was created (which is an entirely different ioctl).
>>>>>>>>>>
>>>>>>>>>>From Rusty Russel's API Design grading, 
>>>>>>>>>drm_i915_gem_execbuffer2 is
>>>>>>>>>>already at -6 without these changes. I think after 
>>>>>>>>>vm_bind we'll need
>>>>>>>>>>to create a -11 entry just to deal with this ioctl.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>The only change here is removing the execlist support for VM_BIND
>>>>>>>>>mode (other than natual extensions).
>>>>>>>>>Adding a new execbuffer3 was considered, but I think 
>>>>>>>>>we need to be careful
>>>>>>>>>with that as that goes beyond the VM_BIND support, 
>>>>>>>>>including any future
>>>>>>>>>requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>>>
>>>>>>>>Why not? it's not like adding extensions here is really 
>>>>>>>>that different
>>>>>>>>than adding new ioctls.
>>>>>>>>
>>>>>>>>I definitely think this deserves an execbuffer3 without even
>>>>>>>>considering future requirements. Just  to burn down the old
>>>>>>>>requirements and pointless fields.
>>>>>>>>
>>>>>>>>Make execbuffer3 be vm bind only, no relocs, no legacy 
>>>>>>>>bits, leave the
>>>>>>>>older sw on execbuf2 for ever.
>>>>>>>
>>>>>>>I guess another point in favour of execbuf3 would be that it's less
>>>>>>>midlayer. If we share the entry point then there's quite a few vfuncs
>>>>>>>needed to cleanly split out the vm_bind paths from the legacy
>>>>>>>reloc/softping paths.
>>>>>>>
>>>>>>>If we invert this and do execbuf3, then there's the existing ioctl
>>>>>>>vfunc, and then we share code (where it even makes sense, probably
>>>>>>>request setup/submit need to be shared, anything else is probably
>>>>>>>cleaner to just copypaste) with the usual helper approach.
>>>>>>>
>>>>>>>Also that would guarantee that really none of the old concepts like
>>>>>>>i915_active on the vma or vma open counts and all that stuff leaks
>>>>>>>into the new vm_bind execbuf.
>>>>>>>
>>>>>>>Finally I also think that copypasting would make backporting easier,
>>>>>>>or at least more flexible, since it should make it easier to have the
>>>>>>>upstream vm_bind co-exist with all the other things we have. Without
>>>>>>>huge amounts of conflicts (or at least much less) that pushing a pile
>>>>>>>of vfuncs into the existing code would cause.
>>>>>>>
>>>>>>>So maybe we should do this?
>>>>>>
>>>>>>Thanks Dave, Daniel.
>>>>>>There are a few things that will be common between execbuf2 and
>>>>>>execbuf3, like request setup/submit (as you said), fence 
>>>>>>handling (timeline fences, fence array, composite fences), 
>>>>>>engine selection,
>>>>>>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>>>>bit position will differ).
>>>>>>But I guess these should be fine as the suggestion here is to
>>>>>>copy-paste the execbuff code and having a shared code where possible.
>>>>>>Besides, we can stop supporting some older feature in execbuff3
>>>>>>(like fence array in favor of newer timeline fences), which will
>>>>>>further reduce common code.
>>>>>>
>>>>>>Ok, I will update this series by adding execbuf3 and send out soon.
>>>>>>
>>>>>
>>>>>Does this sound reasonable?
>>>>
>>>>
>>>>Thanks for proposing this. Some comments below.
>>>>
>>>>
>>>>>
>>>>>struct drm_i915_gem_execbuffer3 {
>>>>>       __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>>>
>>>>>       __u32 batch_count;
>>>>>       __u64 batch_addr_ptr;    /* Pointer to an array of 
>>>>>batch gpu virtual addresses */
>>>>>
>>>>>       __u64 flags;
>>>>>#define I915_EXEC3_RING_MASK              (0x3f)
>>>>>#define I915_EXEC3_DEFAULT                (0<<0)
>>>>>#define I915_EXEC3_RENDER                 (1<<0)
>>>>>#define I915_EXEC3_BSD                    (2<<0)
>>>>>#define I915_EXEC3_BLT                    (3<<0)
>>>>>#define I915_EXEC3_VEBOX                  (4<<0)
>>>>
>>>>
>>>>Shouldn't we use the new engine selection uAPI instead?
>>>>
>>>>We can already create an engine map with 
>>>>I915_CONTEXT_PARAM_ENGINES in 
>>>>drm_i915_gem_context_create_ext_setparam.
>>>>
>>>>And you can also create virtual engines with the same extension.
>>>>
>>>>It feels like this could be a single u32 with the engine index 
>>>>(in the context engine map).
>>>
>>>Yes I said the same yesterday.
>>>
>>>Also note that as you can't any longer set engines on a default 
>>>context, question is whether userspace cares to use execbuf3 with 
>>>it (default context).
>>>
>>>If it does, it will need an alternative engine selection for that 
>>>case. I was proposing class:instance rather than legacy cumbersome 
>>>flags.
>>>
>>>If it does not, I  mean if the decision is to only allow execbuf3 
>>>with engine maps, then it leaves the default context a waste of 
>>>kernel memory in the execbuf3 future. :( Don't know what to do 
>>>there..
>>>
>>>Regards,
>>>
>>>Tvrtko
>>
>>
>>Thanks Tvrtko, I only saw your reply after responding.
>>
>>
>>Both Iris & Anv create a context with engines (if kernel supports 
>>it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_gem.c#L73
>>
>>
>>I think we should be fine with just a single engine id and we don't 
>>care about the default context.
>
>I wonder if in this case we could stop creating the default context 
>starting from a future "gen"? Otherwise, with engine map only execbuf3 
>and execbuf3 only userspace, it would serve no purpose apart from 
>wasting kernel memory.
>

Thanks Tvrtko, Lionell.

I will be glad to remove these flags, just define a uint32 engine_id and
mandate a context with user engines map.

Regarding removing the default context, yah, it depends on from which gen
onwards we will only be supporting execbuf3 and execbuf2 is fully
deprecated. Till then, we will have to keep it I guess :(.

>Regards,
>
>Tvrtko
>
>>
>>
>>-Lionel
>>
>>
>>>
>>>>
>>>>
>>>>>
>>>>>#define I915_EXEC3_SECURE               (1<<6)
>>>>>#define I915_EXEC3_IS_PINNED            (1<<7)
>>>>
>>>>
>>>>What's the meaning of PINNED?
>>>>

This turned out to be a legacy use case. Will remove it.
execbuf3 will anyway only be supported when HAS_VM_BIND is true.

>>>>
>>>>>
>>>>>#define I915_EXEC3_BSD_SHIFT     (8)
>>>>>#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>>>>#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>>>>#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>>>>#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>>>>
>>>>>#define I915_EXEC3_FENCE_IN             (1<<10)
>>>>>#define I915_EXEC3_FENCE_OUT            (1<<11)
>>>>
>>>>
>>>>For Mesa, as soon as we have 
>>>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use 
>>>>that.
>>>>
>>>>So there isn't much point for FENCE_IN/OUT.
>>>>
>>>>Maybe check with other UMDs?
>>>>

Thanks, will remove it if other UMDs do not ask for it.

>>>>
>>>>>#define I915_EXEC3_FENCE_SUBMIT (1<<12)
>>>>
>>>>
>>>>What's FENCE_SUBMIT?
>>>>

This seems to be a mechanism to align requests submissions together.
As per Tvrtko, generic mechanism to align submissions was rejected.
So, if UMDs don't need it, we can remove it.

So, execbuf3 would look like (if all UMDS agree),

struct drm_i915_gem_execbuffer3 {
       __u32 ctx_id;       /* previously execbuffer2.rsvd1 */
       __u32 engine_id;    /* previously 'execbuffer2.flags & I915_EXEC_RING_MASK' */

       __u32 rsvd1;        /* Reserved */
       __u32 batch_count;
       /* batch VA if batch_count=1, otherwise a pointer to an array of batch VAs */
       __u64 batch_address;

       __u64 flags;
#define I915_EXEC3_SECURE   (1<<0)

       __u64 rsvd2;        /* Reserved */
       __u64 extensions;   /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
};

Also, wondered if we need to put timeline fences in the extension or should
we directly put it in drm_i915_gem_execbuffer3 struct.
I prefer putting it in extension if they are not specified for all execbuff calls.
Any thoughts?

Niranjana

>>>>
>>>>>
>>>>>       __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>>>>
>>>>>       __u64 extensions;        /* currently only for 
>>>>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>>>>};
>>>>>
>>>>>With this, user can pass in batch addresses and count directly,
>>>>>instead of as an extension (as this rfc series was proposing).
>>>>>
>>>>>I have removed many of the flags which were either legacy or not
>>>>>applicable to BM_BIND mode.
>>>>>I have also removed fence array support (execbuffer2.cliprects_ptr)
>>>>>as we have timeline fence array support. Is that fine?
>>>>>Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>>>>
>>>>>Any thing else needs to be added or removed?
>>>>>
>>>>>Niranjana
>>>>>
>>>>>>Niranjana
>>>>>>
>>>>>>>-Daniel
>>>>>>>-- 
>>>>>>>Daniel Vetter
>>>>>>>Software Engineer, Intel Corporation
>>>>>>>http://blog.ffwll.ch
>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-08 20:45                         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 20:45 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Wilson, Chris P, Zanoni, Paulo R, intel-gfx, dri-devel,
	Hellstrom, Thomas, Vetter, Daniel, christian.koenig

On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:
>
>On 08/06/2022 09:45, Lionel Landwerlin wrote:
>>On 08/06/2022 11:36, Tvrtko Ursulin wrote:
>>>
>>>On 08/06/2022 07:40, Lionel Landwerlin wrote:
>>>>On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>>>>>On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana 
>>>>>Vishwanathapura wrote:
>>>>>>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>>>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>>
>>>>>>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>
>>>>>>>>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana 
>>>>>>>>>Vishwanathapura wrote:
>>>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>>>
>>>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>>>>     from Daniel.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>>>+++++++++++++++++++++++++++
>>>>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>> new file mode 100644
>>>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>>>> --- /dev/null
>>>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>>>> +/*
>>>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>>>> + */
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>>>> + *
>>>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>>>> + */
>>>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>>>> + *
>>>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding 
>>>>>>>>>during VM creation.
>>>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>>>> + *
>>>>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>>>>execbuff mode of binding.
>>>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept 
>>>>>>>>>any execlist (ie., the
>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 
>>>>>>>>>extension must be provided
>>>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>>>> + *
>>>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>>>> + * (not used) in VM_BIND mode. 
>>>>>>>>>I915_EXEC_USE_EXTENSIONS flag must always be
>>>>>>>>>>> + * set (See struct 
>>>>>>>>>drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>>>> + * The buffers_ptr, buffer_count, 
>>>>>>>>>batch_start_offset and batch_len fields
>>>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not 
>>>>>>>>>used and must be 0.
>>>>>>>>>>> + */
>>>>>>>>>>
>>>>>>>>>>From that description, it seems we have:
>>>>>>>>>>
>>>>>>>>>>struct drm_i915_gem_execbuffer2 {
>>>>>>>>>>        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>>>        __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>>>        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>>>        __u32 batch_len;                -> must be 0 (new)
>>>>>>>>>>        __u32 DR1;                      -> must be 0 (old)
>>>>>>>>>>        __u32 DR4;                      -> must be 0 (old)
>>>>>>>>>>        __u32 num_cliprects; (fences)   -> must be 0 
>>>>>>>>>since using extensions
>>>>>>>>>>        __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>>>>contains an actual pointer!
>>>>>>>>>>        __u64 flags;                    -> some flags 
>>>>>>>>>must be 0 (new)
>>>>>>>>>>        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>>>>        __u64 rsvd2;                    -> unused
>>>>>>>>>>};
>>>>>>>>>>
>>>>>>>>>>Based on that, why can't we just get 
>>>>>>>>>drm_i915_gem_execbuffer3 instead
>>>>>>>>>>of adding even more complexity to an already abused 
>>>>>>>>>interface? While
>>>>>>>>>>the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>>>>we're doing here is extending the ioctl usage, we're completely
>>>>>>>>>>changing how the base struct should be interpreted 
>>>>>>>>>based on how the VM
>>>>>>>>>>was created (which is an entirely different ioctl).
>>>>>>>>>>
>>>>>>>>>>From Rusty Russel's API Design grading, 
>>>>>>>>>drm_i915_gem_execbuffer2 is
>>>>>>>>>>already at -6 without these changes. I think after 
>>>>>>>>>vm_bind we'll need
>>>>>>>>>>to create a -11 entry just to deal with this ioctl.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>The only change here is removing the execlist support for VM_BIND
>>>>>>>>>mode (other than natual extensions).
>>>>>>>>>Adding a new execbuffer3 was considered, but I think 
>>>>>>>>>we need to be careful
>>>>>>>>>with that as that goes beyond the VM_BIND support, 
>>>>>>>>>including any future
>>>>>>>>>requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>>>
>>>>>>>>Why not? it's not like adding extensions here is really 
>>>>>>>>that different
>>>>>>>>than adding new ioctls.
>>>>>>>>
>>>>>>>>I definitely think this deserves an execbuffer3 without even
>>>>>>>>considering future requirements. Just  to burn down the old
>>>>>>>>requirements and pointless fields.
>>>>>>>>
>>>>>>>>Make execbuffer3 be vm bind only, no relocs, no legacy 
>>>>>>>>bits, leave the
>>>>>>>>older sw on execbuf2 for ever.
>>>>>>>
>>>>>>>I guess another point in favour of execbuf3 would be that it's less
>>>>>>>midlayer. If we share the entry point then there's quite a few vfuncs
>>>>>>>needed to cleanly split out the vm_bind paths from the legacy
>>>>>>>reloc/softping paths.
>>>>>>>
>>>>>>>If we invert this and do execbuf3, then there's the existing ioctl
>>>>>>>vfunc, and then we share code (where it even makes sense, probably
>>>>>>>request setup/submit need to be shared, anything else is probably
>>>>>>>cleaner to just copypaste) with the usual helper approach.
>>>>>>>
>>>>>>>Also that would guarantee that really none of the old concepts like
>>>>>>>i915_active on the vma or vma open counts and all that stuff leaks
>>>>>>>into the new vm_bind execbuf.
>>>>>>>
>>>>>>>Finally I also think that copypasting would make backporting easier,
>>>>>>>or at least more flexible, since it should make it easier to have the
>>>>>>>upstream vm_bind co-exist with all the other things we have. Without
>>>>>>>huge amounts of conflicts (or at least much less) that pushing a pile
>>>>>>>of vfuncs into the existing code would cause.
>>>>>>>
>>>>>>>So maybe we should do this?
>>>>>>
>>>>>>Thanks Dave, Daniel.
>>>>>>There are a few things that will be common between execbuf2 and
>>>>>>execbuf3, like request setup/submit (as you said), fence 
>>>>>>handling (timeline fences, fence array, composite fences), 
>>>>>>engine selection,
>>>>>>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>>>>bit position will differ).
>>>>>>But I guess these should be fine as the suggestion here is to
>>>>>>copy-paste the execbuff code and having a shared code where possible.
>>>>>>Besides, we can stop supporting some older feature in execbuff3
>>>>>>(like fence array in favor of newer timeline fences), which will
>>>>>>further reduce common code.
>>>>>>
>>>>>>Ok, I will update this series by adding execbuf3 and send out soon.
>>>>>>
>>>>>
>>>>>Does this sound reasonable?
>>>>
>>>>
>>>>Thanks for proposing this. Some comments below.
>>>>
>>>>
>>>>>
>>>>>struct drm_i915_gem_execbuffer3 {
>>>>>       __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>>>
>>>>>       __u32 batch_count;
>>>>>       __u64 batch_addr_ptr;    /* Pointer to an array of 
>>>>>batch gpu virtual addresses */
>>>>>
>>>>>       __u64 flags;
>>>>>#define I915_EXEC3_RING_MASK              (0x3f)
>>>>>#define I915_EXEC3_DEFAULT                (0<<0)
>>>>>#define I915_EXEC3_RENDER                 (1<<0)
>>>>>#define I915_EXEC3_BSD                    (2<<0)
>>>>>#define I915_EXEC3_BLT                    (3<<0)
>>>>>#define I915_EXEC3_VEBOX                  (4<<0)
>>>>
>>>>
>>>>Shouldn't we use the new engine selection uAPI instead?
>>>>
>>>>We can already create an engine map with 
>>>>I915_CONTEXT_PARAM_ENGINES in 
>>>>drm_i915_gem_context_create_ext_setparam.
>>>>
>>>>And you can also create virtual engines with the same extension.
>>>>
>>>>It feels like this could be a single u32 with the engine index 
>>>>(in the context engine map).
>>>
>>>Yes I said the same yesterday.
>>>
>>>Also note that as you can't any longer set engines on a default 
>>>context, question is whether userspace cares to use execbuf3 with 
>>>it (default context).
>>>
>>>If it does, it will need an alternative engine selection for that 
>>>case. I was proposing class:instance rather than legacy cumbersome 
>>>flags.
>>>
>>>If it does not, I  mean if the decision is to only allow execbuf3 
>>>with engine maps, then it leaves the default context a waste of 
>>>kernel memory in the execbuf3 future. :( Don't know what to do 
>>>there..
>>>
>>>Regards,
>>>
>>>Tvrtko
>>
>>
>>Thanks Tvrtko, I only saw your reply after responding.
>>
>>
>>Both Iris & Anv create a context with engines (if kernel supports 
>>it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_gem.c#L73
>>
>>
>>I think we should be fine with just a single engine id and we don't 
>>care about the default context.
>
>I wonder if in this case we could stop creating the default context 
>starting from a future "gen"? Otherwise, with engine map only execbuf3 
>and execbuf3 only userspace, it would serve no purpose apart from 
>wasting kernel memory.
>

Thanks Tvrtko, Lionell.

I will be glad to remove these flags, just define a uint32 engine_id and
mandate a context with user engines map.

Regarding removing the default context, yah, it depends on from which gen
onwards we will only be supporting execbuf3 and execbuf2 is fully
deprecated. Till then, we will have to keep it I guess :(.

>Regards,
>
>Tvrtko
>
>>
>>
>>-Lionel
>>
>>
>>>
>>>>
>>>>
>>>>>
>>>>>#define I915_EXEC3_SECURE               (1<<6)
>>>>>#define I915_EXEC3_IS_PINNED            (1<<7)
>>>>
>>>>
>>>>What's the meaning of PINNED?
>>>>

This turned out to be a legacy use case. Will remove it.
execbuf3 will anyway only be supported when HAS_VM_BIND is true.

>>>>
>>>>>
>>>>>#define I915_EXEC3_BSD_SHIFT     (8)
>>>>>#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>>>>#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>>>>#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>>>>#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>>>>
>>>>>#define I915_EXEC3_FENCE_IN             (1<<10)
>>>>>#define I915_EXEC3_FENCE_OUT            (1<<11)
>>>>
>>>>
>>>>For Mesa, as soon as we have 
>>>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use 
>>>>that.
>>>>
>>>>So there isn't much point for FENCE_IN/OUT.
>>>>
>>>>Maybe check with other UMDs?
>>>>

Thanks, will remove it if other UMDs do not ask for it.

>>>>
>>>>>#define I915_EXEC3_FENCE_SUBMIT (1<<12)
>>>>
>>>>
>>>>What's FENCE_SUBMIT?
>>>>

This seems to be a mechanism to align requests submissions together.
As per Tvrtko, generic mechanism to align submissions was rejected.
So, if UMDs don't need it, we can remove it.

So, execbuf3 would look like (if all UMDS agree),

struct drm_i915_gem_execbuffer3 {
       __u32 ctx_id;       /* previously execbuffer2.rsvd1 */
       __u32 engine_id;    /* previously 'execbuffer2.flags & I915_EXEC_RING_MASK' */

       __u32 rsvd1;        /* Reserved */
       __u32 batch_count;
       /* batch VA if batch_count=1, otherwise a pointer to an array of batch VAs */
       __u64 batch_address;

       __u64 flags;
#define I915_EXEC3_SECURE   (1<<0)

       __u64 rsvd2;        /* Reserved */
       __u64 extensions;   /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
};

Also, wondered if we need to put timeline fences in the extension or should
we directly put it in drm_i915_gem_execbuffer3 struct.
I prefer putting it in extension if they are not specified for all execbuff calls.
Any thoughts?

Niranjana

>>>>
>>>>>
>>>>>       __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>>>>
>>>>>       __u64 extensions;        /* currently only for 
>>>>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>>>>};
>>>>>
>>>>>With this, user can pass in batch addresses and count directly,
>>>>>instead of as an extension (as this rfc series was proposing).
>>>>>
>>>>>I have removed many of the flags which were either legacy or not
>>>>>applicable to BM_BIND mode.
>>>>>I have also removed fence array support (execbuffer2.cliprects_ptr)
>>>>>as we have timeline fence array support. Is that fine?
>>>>>Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>>>>
>>>>>Any thing else needs to be added or removed?
>>>>>
>>>>>Niranjana
>>>>>
>>>>>>Niranjana
>>>>>>
>>>>>>>-Daniel
>>>>>>>-- 
>>>>>>>Daniel Vetter
>>>>>>>Software Engineer, Intel Corporation
>>>>>>>http://blog.ffwll.ch
>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  7:12               ` Lionel Landwerlin
@ 2022-06-08 21:24                   ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-08 21:24 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, Niranjana Vishwanathapura,
	christian.koenig

On Wed, Jun 08, 2022 at 10:12:45AM +0300, Lionel Landwerlin wrote:
> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
> > On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura
> > wrote:
> > > On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
> > > > On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
> > > > > 
> > > > > On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com> wrote:
> > > > > > 
> > > > > > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> > > > > > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> > > > > > >> VM_BIND and related uapi definitions
> > > > > > >>
> > > > > > >> v2: Ensure proper kernel-doc formatting with cross references.
> > > > > > >>     Also add new uapi and documentation as per review comments
> > > > > > >>     from Daniel.
> > > > > > >>
> > > > > > >> Signed-off-by: Niranjana Vishwanathapura
> > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > >> ---
> > > > > > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399
> > > > > > +++++++++++++++++++++++++++
> > > > > > >>  1 file changed, 399 insertions(+)
> > > > > > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > >>
> > > > > > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > b/Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > >> new file mode 100644
> > > > > > >> index 000000000000..589c0a009107
> > > > > > >> --- /dev/null
> > > > > > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > >> @@ -0,0 +1,399 @@
> > > > > > >> +/* SPDX-License-Identifier: MIT */
> > > > > > >> +/*
> > > > > > >> + * Copyright © 2022 Intel Corporation
> > > > > > >> + */
> > > > > > >> +
> > > > > > >> +/**
> > > > > > >> + * DOC: I915_PARAM_HAS_VM_BIND
> > > > > > >> + *
> > > > > > >> + * VM_BIND feature availability.
> > > > > > >> + * See typedef drm_i915_getparam_t param.
> > > > > > >> + */
> > > > > > >> +#define I915_PARAM_HAS_VM_BIND               57
> > > > > > >> +
> > > > > > >> +/**
> > > > > > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> > > > > > >> + *
> > > > > > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> > > > > > >> + * See struct drm_i915_gem_vm_control flags.
> > > > > > >> + *
> > > > > > >> + * A VM in VM_BIND mode will not support the older
> > > > > > execbuff mode of binding.
> > > > > > >> + * In VM_BIND mode, execbuff ioctl will not accept
> > > > > > any execlist (ie., the
> > > > > > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> > > > > > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> > > > > > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> > > > > > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES
> > > > > > extension must be provided
> > > > > > >> + * to pass in the batch buffer addresses.
> > > > > > >> + *
> > > > > > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> > > > > > >> + * I915_EXEC_BATCH_FIRST of
> > > > > > &drm_i915_gem_execbuffer2.flags must be 0
> > > > > > >> + * (not used) in VM_BIND mode.
> > > > > > I915_EXEC_USE_EXTENSIONS flag must always be
> > > > > > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> > > > > > >> + * The buffers_ptr, buffer_count, batch_start_offset
> > > > > > and batch_len fields
> > > > > > >> + * of struct drm_i915_gem_execbuffer2 are also not
> > > > > > used and must be 0.
> > > > > > >> + */
> > > > > > >
> > > > > > >From that description, it seems we have:
> > > > > > >
> > > > > > >struct drm_i915_gem_execbuffer2 {
> > > > > > >        __u64 buffers_ptr;              -> must be 0 (new)
> > > > > > >        __u32 buffer_count;             -> must be 0 (new)
> > > > > > >        __u32 batch_start_offset;       -> must be 0 (new)
> > > > > > >        __u32 batch_len;                -> must be 0 (new)
> > > > > > >        __u32 DR1;                      -> must be 0 (old)
> > > > > > >        __u32 DR4;                      -> must be 0 (old)
> > > > > > >        __u32 num_cliprects; (fences)   -> must be 0
> > > > > > since using extensions
> > > > > > >        __u64 cliprects_ptr; (fences, extensions) ->
> > > > > > contains an actual pointer!
> > > > > > >        __u64 flags;                    -> some flags
> > > > > > must be 0 (new)
> > > > > > >        __u64 rsvd1; (context info)     -> repurposed field (old)
> > > > > > >        __u64 rsvd2;                    -> unused
> > > > > > >};
> > > > > > >
> > > > > > >Based on that, why can't we just get
> > > > > > drm_i915_gem_execbuffer3 instead
> > > > > > >of adding even more complexity to an already abused interface? While
> > > > > > >the Vulkan-like extension thing is really nice, I don't think what
> > > > > > >we're doing here is extending the ioctl usage, we're completely
> > > > > > >changing how the base struct should be interpreted
> > > > > > based on how the VM
> > > > > > >was created (which is an entirely different ioctl).
> > > > > > >
> > > > > > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> > > > > > >already at -6 without these changes. I think after
> > > > > > vm_bind we'll need
> > > > > > >to create a -11 entry just to deal with this ioctl.
> > > > > > >
> > > > > > 
> > > > > > The only change here is removing the execlist support for VM_BIND
> > > > > > mode (other than natual extensions).
> > > > > > Adding a new execbuffer3 was considered, but I think we
> > > > > > need to be careful
> > > > > > with that as that goes beyond the VM_BIND support,
> > > > > > including any future
> > > > > > requirements (as we don't want an execbuffer4 after VM_BIND).
> > > > > 
> > > > > Why not? it's not like adding extensions here is really that different
> > > > > than adding new ioctls.
> > > > > 
> > > > > I definitely think this deserves an execbuffer3 without even
> > > > > considering future requirements. Just  to burn down the old
> > > > > requirements and pointless fields.
> > > > > 
> > > > > Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
> > > > > older sw on execbuf2 for ever.
> > > > 
> > > > I guess another point in favour of execbuf3 would be that it's less
> > > > midlayer. If we share the entry point then there's quite a few vfuncs
> > > > needed to cleanly split out the vm_bind paths from the legacy
> > > > reloc/softping paths.
> > > > 
> > > > If we invert this and do execbuf3, then there's the existing ioctl
> > > > vfunc, and then we share code (where it even makes sense, probably
> > > > request setup/submit need to be shared, anything else is probably
> > > > cleaner to just copypaste) with the usual helper approach.
> > > > 
> > > > Also that would guarantee that really none of the old concepts like
> > > > i915_active on the vma or vma open counts and all that stuff leaks
> > > > into the new vm_bind execbuf.
> > > > 
> > > > Finally I also think that copypasting would make backporting easier,
> > > > or at least more flexible, since it should make it easier to have the
> > > > upstream vm_bind co-exist with all the other things we have. Without
> > > > huge amounts of conflicts (or at least much less) that pushing a pile
> > > > of vfuncs into the existing code would cause.
> > > > 
> > > > So maybe we should do this?
> > > 
> > > Thanks Dave, Daniel.
> > > There are a few things that will be common between execbuf2 and
> > > execbuf3, like request setup/submit (as you said), fence handling
> > > (timeline fences, fence array, composite fences), engine selection,
> > > etc. Also, many of the 'flags' will be there in execbuf3 also (but
> > > bit position will differ).
> > > But I guess these should be fine as the suggestion here is to
> > > copy-paste the execbuff code and having a shared code where possible.
> > > Besides, we can stop supporting some older feature in execbuff3
> > > (like fence array in favor of newer timeline fences), which will
> > > further reduce common code.
> > > 
> > > Ok, I will update this series by adding execbuf3 and send out soon.
> > > 
> > 
> > Does this sound reasonable?
> > 
> > struct drm_i915_gem_execbuffer3 {
> >        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
> > 
> >        __u32 batch_count;
> >        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu
> > virtual addresses */
> 
> 
> Quick question raised on IRC about the batches : Are multiple batches
> limited to virtual engines?
> 

Parallel engines, see i915_context_engines_parallel_submit in i915_drm.h.

Currently the media UMD uses this uAPI to do split frame (e.g. run
multiple batches in parallel on the video engines to decode a 8k frame).

Of course there could be future users of this uAPI too.

Matt

> 
> Thanks,
> 
> 
> -Lionel
> 
> 
> > 
> >        __u64 flags;
> > #define I915_EXEC3_RING_MASK              (0x3f)
> > #define I915_EXEC3_DEFAULT                (0<<0)
> > #define I915_EXEC3_RENDER                 (1<<0)
> > #define I915_EXEC3_BSD                    (2<<0)
> > #define I915_EXEC3_BLT                    (3<<0)
> > #define I915_EXEC3_VEBOX                  (4<<0)
> > 
> > #define I915_EXEC3_SECURE               (1<<6)
> > #define I915_EXEC3_IS_PINNED            (1<<7)
> > 
> > #define I915_EXEC3_BSD_SHIFT     (8)
> > #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> > #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> > #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> > #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
> > 
> > #define I915_EXEC3_FENCE_IN             (1<<10)
> > #define I915_EXEC3_FENCE_OUT            (1<<11)
> > #define I915_EXEC3_FENCE_SUBMIT         (1<<12)
> > 
> >        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
> > 
> >        __u64 extensions;        /* currently only for
> > DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> > };
> > 
> > With this, user can pass in batch addresses and count directly,
> > instead of as an extension (as this rfc series was proposing).
> > 
> > I have removed many of the flags which were either legacy or not
> > applicable to BM_BIND mode.
> > I have also removed fence array support (execbuffer2.cliprects_ptr)
> > as we have timeline fence array support. Is that fine?
> > Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
> > 
> > Any thing else needs to be added or removed?
> > 
> > Niranjana
> > 
> > > Niranjana
> > > 
> > > > -Daniel
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-08 21:24                   ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-08 21:24 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, Jun 08, 2022 at 10:12:45AM +0300, Lionel Landwerlin wrote:
> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
> > On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura
> > wrote:
> > > On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
> > > > On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
> > > > > 
> > > > > On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com> wrote:
> > > > > > 
> > > > > > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> > > > > > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> > > > > > >> VM_BIND and related uapi definitions
> > > > > > >>
> > > > > > >> v2: Ensure proper kernel-doc formatting with cross references.
> > > > > > >>     Also add new uapi and documentation as per review comments
> > > > > > >>     from Daniel.
> > > > > > >>
> > > > > > >> Signed-off-by: Niranjana Vishwanathapura
> > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > >> ---
> > > > > > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399
> > > > > > +++++++++++++++++++++++++++
> > > > > > >>  1 file changed, 399 insertions(+)
> > > > > > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > >>
> > > > > > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > b/Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > >> new file mode 100644
> > > > > > >> index 000000000000..589c0a009107
> > > > > > >> --- /dev/null
> > > > > > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> > > > > > >> @@ -0,0 +1,399 @@
> > > > > > >> +/* SPDX-License-Identifier: MIT */
> > > > > > >> +/*
> > > > > > >> + * Copyright © 2022 Intel Corporation
> > > > > > >> + */
> > > > > > >> +
> > > > > > >> +/**
> > > > > > >> + * DOC: I915_PARAM_HAS_VM_BIND
> > > > > > >> + *
> > > > > > >> + * VM_BIND feature availability.
> > > > > > >> + * See typedef drm_i915_getparam_t param.
> > > > > > >> + */
> > > > > > >> +#define I915_PARAM_HAS_VM_BIND               57
> > > > > > >> +
> > > > > > >> +/**
> > > > > > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> > > > > > >> + *
> > > > > > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> > > > > > >> + * See struct drm_i915_gem_vm_control flags.
> > > > > > >> + *
> > > > > > >> + * A VM in VM_BIND mode will not support the older
> > > > > > execbuff mode of binding.
> > > > > > >> + * In VM_BIND mode, execbuff ioctl will not accept
> > > > > > any execlist (ie., the
> > > > > > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> > > > > > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> > > > > > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> > > > > > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES
> > > > > > extension must be provided
> > > > > > >> + * to pass in the batch buffer addresses.
> > > > > > >> + *
> > > > > > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> > > > > > >> + * I915_EXEC_BATCH_FIRST of
> > > > > > &drm_i915_gem_execbuffer2.flags must be 0
> > > > > > >> + * (not used) in VM_BIND mode.
> > > > > > I915_EXEC_USE_EXTENSIONS flag must always be
> > > > > > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> > > > > > >> + * The buffers_ptr, buffer_count, batch_start_offset
> > > > > > and batch_len fields
> > > > > > >> + * of struct drm_i915_gem_execbuffer2 are also not
> > > > > > used and must be 0.
> > > > > > >> + */
> > > > > > >
> > > > > > >From that description, it seems we have:
> > > > > > >
> > > > > > >struct drm_i915_gem_execbuffer2 {
> > > > > > >        __u64 buffers_ptr;              -> must be 0 (new)
> > > > > > >        __u32 buffer_count;             -> must be 0 (new)
> > > > > > >        __u32 batch_start_offset;       -> must be 0 (new)
> > > > > > >        __u32 batch_len;                -> must be 0 (new)
> > > > > > >        __u32 DR1;                      -> must be 0 (old)
> > > > > > >        __u32 DR4;                      -> must be 0 (old)
> > > > > > >        __u32 num_cliprects; (fences)   -> must be 0
> > > > > > since using extensions
> > > > > > >        __u64 cliprects_ptr; (fences, extensions) ->
> > > > > > contains an actual pointer!
> > > > > > >        __u64 flags;                    -> some flags
> > > > > > must be 0 (new)
> > > > > > >        __u64 rsvd1; (context info)     -> repurposed field (old)
> > > > > > >        __u64 rsvd2;                    -> unused
> > > > > > >};
> > > > > > >
> > > > > > >Based on that, why can't we just get
> > > > > > drm_i915_gem_execbuffer3 instead
> > > > > > >of adding even more complexity to an already abused interface? While
> > > > > > >the Vulkan-like extension thing is really nice, I don't think what
> > > > > > >we're doing here is extending the ioctl usage, we're completely
> > > > > > >changing how the base struct should be interpreted
> > > > > > based on how the VM
> > > > > > >was created (which is an entirely different ioctl).
> > > > > > >
> > > > > > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> > > > > > >already at -6 without these changes. I think after
> > > > > > vm_bind we'll need
> > > > > > >to create a -11 entry just to deal with this ioctl.
> > > > > > >
> > > > > > 
> > > > > > The only change here is removing the execlist support for VM_BIND
> > > > > > mode (other than natual extensions).
> > > > > > Adding a new execbuffer3 was considered, but I think we
> > > > > > need to be careful
> > > > > > with that as that goes beyond the VM_BIND support,
> > > > > > including any future
> > > > > > requirements (as we don't want an execbuffer4 after VM_BIND).
> > > > > 
> > > > > Why not? it's not like adding extensions here is really that different
> > > > > than adding new ioctls.
> > > > > 
> > > > > I definitely think this deserves an execbuffer3 without even
> > > > > considering future requirements. Just  to burn down the old
> > > > > requirements and pointless fields.
> > > > > 
> > > > > Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
> > > > > older sw on execbuf2 for ever.
> > > > 
> > > > I guess another point in favour of execbuf3 would be that it's less
> > > > midlayer. If we share the entry point then there's quite a few vfuncs
> > > > needed to cleanly split out the vm_bind paths from the legacy
> > > > reloc/softping paths.
> > > > 
> > > > If we invert this and do execbuf3, then there's the existing ioctl
> > > > vfunc, and then we share code (where it even makes sense, probably
> > > > request setup/submit need to be shared, anything else is probably
> > > > cleaner to just copypaste) with the usual helper approach.
> > > > 
> > > > Also that would guarantee that really none of the old concepts like
> > > > i915_active on the vma or vma open counts and all that stuff leaks
> > > > into the new vm_bind execbuf.
> > > > 
> > > > Finally I also think that copypasting would make backporting easier,
> > > > or at least more flexible, since it should make it easier to have the
> > > > upstream vm_bind co-exist with all the other things we have. Without
> > > > huge amounts of conflicts (or at least much less) that pushing a pile
> > > > of vfuncs into the existing code would cause.
> > > > 
> > > > So maybe we should do this?
> > > 
> > > Thanks Dave, Daniel.
> > > There are a few things that will be common between execbuf2 and
> > > execbuf3, like request setup/submit (as you said), fence handling
> > > (timeline fences, fence array, composite fences), engine selection,
> > > etc. Also, many of the 'flags' will be there in execbuf3 also (but
> > > bit position will differ).
> > > But I guess these should be fine as the suggestion here is to
> > > copy-paste the execbuff code and having a shared code where possible.
> > > Besides, we can stop supporting some older feature in execbuff3
> > > (like fence array in favor of newer timeline fences), which will
> > > further reduce common code.
> > > 
> > > Ok, I will update this series by adding execbuf3 and send out soon.
> > > 
> > 
> > Does this sound reasonable?
> > 
> > struct drm_i915_gem_execbuffer3 {
> >        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
> > 
> >        __u32 batch_count;
> >        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu
> > virtual addresses */
> 
> 
> Quick question raised on IRC about the batches : Are multiple batches
> limited to virtual engines?
> 

Parallel engines, see i915_context_engines_parallel_submit in i915_drm.h.

Currently the media UMD uses this uAPI to do split frame (e.g. run
multiple batches in parallel on the video engines to decode a 8k frame).

Of course there could be future users of this uAPI too.

Matt

> 
> Thanks,
> 
> 
> -Lionel
> 
> 
> > 
> >        __u64 flags;
> > #define I915_EXEC3_RING_MASK              (0x3f)
> > #define I915_EXEC3_DEFAULT                (0<<0)
> > #define I915_EXEC3_RENDER                 (1<<0)
> > #define I915_EXEC3_BSD                    (2<<0)
> > #define I915_EXEC3_BLT                    (3<<0)
> > #define I915_EXEC3_VEBOX                  (4<<0)
> > 
> > #define I915_EXEC3_SECURE               (1<<6)
> > #define I915_EXEC3_IS_PINNED            (1<<7)
> > 
> > #define I915_EXEC3_BSD_SHIFT     (8)
> > #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> > #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> > #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> > #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
> > 
> > #define I915_EXEC3_FENCE_IN             (1<<10)
> > #define I915_EXEC3_FENCE_OUT            (1<<11)
> > #define I915_EXEC3_FENCE_SUBMIT         (1<<12)
> > 
> >        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
> > 
> >        __u64 extensions;        /* currently only for
> > DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> > };
> > 
> > With this, user can pass in batch addresses and count directly,
> > instead of as an extension (as this rfc series was proposing).
> > 
> > I have removed many of the flags which were either legacy or not
> > applicable to BM_BIND mode.
> > I have also removed fence array support (execbuffer2.cliprects_ptr)
> > as we have timeline fence array support. Is that fine?
> > Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
> > 
> > Any thing else needs to be added or removed?
> > 
> > Niranjana
> > 
> > > Niranjana
> > > 
> > > > -Daniel
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  9:12         ` Matthew Auld
@ 2022-06-08 21:32             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 21:32 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Tvrtko Ursulin, intel-gfx, chris.p.wilson, thomas.hellstrom,
	dri-devel, daniel.vetter, christian.koenig

On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>
>>On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>>On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>VM_BIND and related uapi definitions
>>>>>
>>>>>v2: Ensure proper kernel-doc formatting with cross references.
>>>>>    Also add new uapi and documentation as per review comments
>>>>>    from Daniel.
>>>>>
>>>>>Signed-off-by: Niranjana Vishwanathapura 
>>>>><niranjana.vishwanathapura@intel.com>
>>>>>---
>>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>>> 1 file changed, 399 insertions(+)
>>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>
>>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>new file mode 100644
>>>>>index 000000000000..589c0a009107
>>>>>--- /dev/null
>>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>@@ -0,0 +1,399 @@
>>>>>+/* SPDX-License-Identifier: MIT */
>>>>>+/*
>>>>>+ * Copyright © 2022 Intel Corporation
>>>>>+ */
>>>>>+
>>>>>+/**
>>>>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>>>>+ *
>>>>>+ * VM_BIND feature availability.
>>>>>+ * See typedef drm_i915_getparam_t param.
>>>>>+ */
>>>>>+#define I915_PARAM_HAS_VM_BIND        57
>>>>>+
>>>>>+/**
>>>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>+ *
>>>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>+ * See struct drm_i915_gem_vm_control flags.
>>>>>+ *
>>>>>+ * A VM in VM_BIND mode will not support the older execbuff 
>>>>>mode of binding.
>>>>>+ * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>execlist (ie., the
>>>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>be provided
>>>>>+ * to pass in the batch buffer addresses.
>>>>>+ *
>>>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>>>>+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>must always be
>>>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>+ * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>batch_len fields
>>>>>+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>>>>+ */
>>>>>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>+
>>>>>+/**
>>>>>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>+ *
>>>>>+ * Flag to declare context as long running.
>>>>>+ * See struct drm_i915_gem_context_create_ext flags.
>>>>>+ *
>>>>>+ * Usage of dma-fence expects that they complete in 
>>>>>reasonable amount of time.
>>>>>+ * Compute on the other hand can be long running. Hence it is 
>>>>>not appropriate
>>>>>+ * for compute contexts to export request completion 
>>>>>dma-fence to user.
>>>>>+ * The dma-fence usage will be limited to in-kernel consumption only.
>>>>>+ * Compute contexts need to use user/memory fence.
>>>>>+ *
>>>>>+ * So, long running contexts do not support output fences. Hence,
>>>>>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>+ * I915_EXEC_FENCE_SIGNAL (See 
>>>>>&drm_i915_gem_exec_fence.flags) are expected
>>>>>+ * to be not used.
>>>>>+ *
>>>>>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for 
>>>>>objects mapped
>>>>>+ * to long running contexts.
>>>>>+ */
>>>>>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>+
>>>>>+/* VM_BIND related ioctls */
>>>>>+#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>+#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>+#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>+
>>>>>+#define DRM_IOCTL_I915_GEM_VM_BIND        
>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct 
>>>>>drm_i915_gem_vm_bind)
>>>>>+#define DRM_IOCTL_I915_GEM_VM_UNBIND 
>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct 
>>>>>drm_i915_gem_vm_bind)
>>>>>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, 
>>>>>struct drm_i915_gem_wait_user_fence)
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>+ *
>>>>>+ * This structure is passed to VM_BIND ioctl and specifies 
>>>>>the mapping of GPU
>>>>>+ * virtual address (VA) range to the section of an object 
>>>>>that should be bound
>>>>>+ * in the device page table of the specified address space (VM).
>>>>>+ * The VA range specified must be unique (ie., not currently 
>>>>>bound) and can
>>>>>+ * be mapped to whole object or a section of the object 
>>>>>(partial binding).
>>>>>+ * Multiple VA mappings can be created to the same section of 
>>>>>the object
>>>>>+ * (aliasing).
>>>>>+ */
>>>>>+struct drm_i915_gem_vm_bind {
>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>+    __u32 vm_id;
>>>>>+
>>>>>+    /** @handle: Object handle */
>>>>>+    __u32 handle;
>>>>>+
>>>>>+    /** @start: Virtual Address start to bind */
>>>>>+    __u64 start;
>>>>>+
>>>>>+    /** @offset: Offset in object to bind */
>>>>>+    __u64 offset;
>>>>>+
>>>>>+    /** @length: Length of mapping to bind */
>>>>>+    __u64 length;
>>>>
>>>>Does it support, or should it, equivalent of 
>>>>EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map 
>>>>the remainder of the space to a dummy object? In which case 
>>>>would there be any alignment/padding issues preventing the two 
>>>>bind to be placed next to each other?
>>>>
>>>>I ask because someone from the compute side asked me about a 
>>>>problem with their strategy of dealing with overfetch and I 
>>>>suggested pad to size.
>>>>
>>>
>>>Thanks Tvrtko,
>>>I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>is completely pushed to userspace, no padding should be necessary
>>>once the 'start' and 'size' alignment conditions are met.
>>>
>>>I will add some documentation on alignment requirement here.
>>>Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>be 64K aligned.
>>
>>+ Matt
>>
>>Align to 64k is enough for all overfetch issues?
>>
>>Apparently compute has a situation where a buffer is received by one 
>>component and another has to apply more alignment to it, to deal 
>>with overfetch. Since they cannot grow the actual BO if they wanted 
>>to VM_BIND a scratch area on top? Or perhaps none of this is a 
>>problem on discrete and original BO should be correctly allocated to 
>>start with.
>>
>>Side question - what about the align to 2MiB mentioned in 
>>i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply 
>>to discrete?
>
>Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>require a minimum of 64K pages underneath for local memory, and the BO 
>size will also be rounded up accordingly. And yeah the complication 
>arises due to not being able to mix 4K + 64K GTT pages within the same 
>page-table (existed since even gen8). Note that 4K here is what we 
>typically get for system memory.
>
>Originally we had a memory coloring scheme to track the "color" of 
>each page-table, which basically ensures that userspace can't do 
>something nasty like mixing page sizes. The advantage of that scheme 
>is that we would only require 64K GTT alignment and no extra padding, 
>but is perhaps a little complex.
>
>The merged solution is just to align and pad (i.e vma->node.size and 
>not vma->size) out of the vma to 2M, which is dead simple 
>implementation wise, but does potentially waste some GTT space and 
>some of the local memory used for the actual page-table. For the 
>alignment the kernel just validates that the GTT address is aligned to 
>2M in vma_insert(), and then for the padding it just inflates it to 
>2M, if userspace hasn't already.
>
>See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext
>

Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
This is unfortunate, but it is not something new enforced by VM_BIND.
Other option is to go with 64K alignment and in VM_BIND case, user
must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
mappings in the same 2M range. But this is not VM_BIND specific
(will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is
managed fully by the user. If we enforce VA to be 2M aligned, it
will leave holes (if BOs are smaller then 2M), but nobody is going
to allocate anything form there.

Niranjana

>>
>>Regards,
>>
>>Tvrtko
>>
>>>
>>>Niranjana
>>>
>>>>Regards,
>>>>
>>>>Tvrtko
>>>>
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_GEM_VM_BIND_READONLY:
>>>>>+     * Mapping is read-only.
>>>>>+     *
>>>>>+     * I915_GEM_VM_BIND_CAPTURE:
>>>>>+     * Capture this mapping in the dump upon GPU error.
>>>>>+     */
>>>>>+    __u64 flags;
>>>>>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>+
>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>this mapping. */
>>>>>+    __u64 extensions;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>+ *
>>>>>+ * This structure is passed to VM_UNBIND ioctl and specifies 
>>>>>the GPU virtual
>>>>>+ * address (VA) range that should be unbound from the device 
>>>>>page table of the
>>>>>+ * specified address space (VM). The specified VA range must 
>>>>>match one of the
>>>>>+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>>>>+ * completion.
>>>>>+ */
>>>>>+struct drm_i915_gem_vm_unbind {
>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>+    __u32 vm_id;
>>>>>+
>>>>>+    /** @rsvd: Reserved for future use; must be zero. */
>>>>>+    __u32 rsvd;
>>>>>+
>>>>>+    /** @start: Virtual Address start to unbind */
>>>>>+    __u64 start;
>>>>>+
>>>>>+    /** @length: Length of mapping to unbind */
>>>>>+    __u64 length;
>>>>>+
>>>>>+    /** @flags: reserved for future usage, currently MBZ */
>>>>>+    __u64 flags;
>>>>>+
>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>this mapping. */
>>>>>+    __u64 extensions;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_fence - An input or output fence 
>>>>>for the vm_bind
>>>>>+ * or the vm_unbind work.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind aync worker will wait for input 
>>>>>fence to signal
>>>>>+ * before starting the binding or unbinding.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>returned output fence
>>>>>+ * after the completion of binding or unbinding.
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_fence {
>>>>>+    /** @handle: User's handle for a drm_syncobj to wait on 
>>>>>or signal. */
>>>>>+    __u32 handle;
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_VM_BIND_FENCE_WAIT:
>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>+     *
>>>>>+     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>+     * Return bind/unbind completion fence as output
>>>>>+     */
>>>>>+    __u32 flags;
>>>>>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>(-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline 
>>>>>fences for vm_bind
>>>>>+ * and vm_unbind.
>>>>>+ *
>>>>>+ * This structure describes an array of timeline drm_syncobj 
>>>>>and associated
>>>>>+ * points for timeline variants of drm_syncobj. These 
>>>>>timeline 'drm_syncobj's
>>>>>+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>+#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /**
>>>>>+     * @fence_count: Number of elements in the @handles_ptr & 
>>>>>@value_ptr
>>>>>+     * arrays.
>>>>>+     */
>>>>>+    __u64 fence_count;
>>>>>+
>>>>>+    /**
>>>>>+     * @handles_ptr: Pointer to an array of struct 
>>>>>drm_i915_vm_bind_fence
>>>>>+     * of length @fence_count.
>>>>>+     */
>>>>>+    __u64 handles_ptr;
>>>>>+
>>>>>+    /**
>>>>>+     * @values_ptr: Pointer to an array of u64 values of length
>>>>>+     * @fence_count.
>>>>>+     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>>>+     * timeline drm_syncobj is invalid as it turns a 
>>>>>drm_syncobj into a
>>>>>+     * binary one.
>>>>>+     */
>>>>>+    __u64 values_ptr;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_user_fence - An input or output 
>>>>>user fence for the
>>>>>+ * vm_bind or the vm_unbind work.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind aync worker will wait for the 
>>>>>input fence (value at
>>>>>+ * @addr to become equal to @val) before starting the binding 
>>>>>or unbinding.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>output fence after
>>>>>+ * the completion of binding or unbinding by writing @val to 
>>>>>memory location at
>>>>>+ * @addr
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_user_fence {
>>>>>+    /** @addr: User/Memory fence qword aligned process 
>>>>>virtual address */
>>>>>+    __u64 addr;
>>>>>+
>>>>>+    /** @val: User/Memory fence value to be written after 
>>>>>bind completion */
>>>>>+    __u64 val;
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>+     *
>>>>>+     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>+     * Return bind/unbind completion fence as output
>>>>>+     */
>>>>>+    __u32 flags;
>>>>>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>+    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory 
>>>>>fences for vm_bind
>>>>>+ * and vm_unbind.
>>>>>+ *
>>>>>+ * These user fences can be input or output fences
>>>>>+ * (See struct drm_i915_vm_bind_user_fence).
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_ext_user_fence {
>>>>>+#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /** @fence_count: Number of elements in the 
>>>>>@user_fence_ptr array. */
>>>>>+    __u64 fence_count;
>>>>>+
>>>>>+    /**
>>>>>+     * @user_fence_ptr: Pointer to an array of
>>>>>+     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>+     */
>>>>>+    __u64 user_fence_ptr;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array 
>>>>>of batch buffer
>>>>>+ * gpu virtual addresses.
>>>>>+ *
>>>>>+ * In the execbuff ioctl (See struct 
>>>>>drm_i915_gem_execbuffer2), this extension
>>>>>+ * must always be appended in the VM_BIND mode and it will be 
>>>>>an error to
>>>>>+ * append this extension in older non-VM_BIND mode.
>>>>>+ */
>>>>>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /** @count: Number of addresses in the addr array. */
>>>>>+    __u32 count;
>>>>>+
>>>>>+    /** @addr: An array of batch gpu virtual addresses. */
>>>>>+    __u64 addr[0];
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First 
>>>>>level batch completion
>>>>>+ * signaling extension.
>>>>>+ *
>>>>>+ * This extension allows user to attach a user fence (@addr, 
>>>>>@value pair) to an
>>>>>+ * execbuf to be signaled by the command streamer after the 
>>>>>completion of first
>>>>>+ * level batch, by writing the @value at specified @addr and 
>>>>>triggering an
>>>>>+ * interrupt.
>>>>>+ * User can either poll for this user fence to signal or can 
>>>>>also wait on it
>>>>>+ * with i915_gem_wait_user_fence ioctl.
>>>>>+ * This is very much usefaul for long running contexts where 
>>>>>waiting on dma-fence
>>>>>+ * by user (like i915_gem_wait ioctl) is not supported.
>>>>>+ */
>>>>>+struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /**
>>>>>+     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>+     *
>>>>>+     * Address has to be a valid GPU virtual address at the time of
>>>>>+     * first level batch completion.
>>>>>+     */
>>>>>+    __u64 addr;
>>>>>+
>>>>>+    /**
>>>>>+     * @value: User/Memory fence Value to be written to above address
>>>>>+     * after first level batch completes.
>>>>>+     */
>>>>>+    __u64 value;
>>>>>+
>>>>>+    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>+    __u64 rsvd;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_create_ext_vm_private - Extension to 
>>>>>make the object
>>>>>+ * private to the specified VM.
>>>>>+ *
>>>>>+ * See struct drm_i915_gem_create_ext.
>>>>>+ */
>>>>>+struct drm_i915_gem_create_ext_vm_private {
>>>>>+#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /** @vm_id: Id of the VM to which the object is private */
>>>>>+    __u32 vm_id;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>>>+ *
>>>>>+ * User/Memory fence can be woken up either by:
>>>>>+ *
>>>>>+ * 1. GPU context indicated by @ctx_id, or,
>>>>>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>+ *    @ctx_id is ignored when this flag is set.
>>>>>+ *
>>>>>+ * Wakeup condition is,
>>>>>+ * ``((*addr & mask) op (value & mask))``
>>>>>+ *
>>>>>+ * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>><indefinite_dma_fences>`
>>>>>+ */
>>>>>+struct drm_i915_gem_wait_user_fence {
>>>>>+    /** @extensions: Zero-terminated chain of extensions. */
>>>>>+    __u64 extensions;
>>>>>+
>>>>>+    /** @addr: User/Memory fence address */
>>>>>+    __u64 addr;
>>>>>+
>>>>>+    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>+    __u32 ctx_id;
>>>>>+
>>>>>+    /** @op: Wakeup condition operator */
>>>>>+    __u16 op;
>>>>>+#define I915_UFENCE_WAIT_EQ      0
>>>>>+#define I915_UFENCE_WAIT_NEQ     1
>>>>>+#define I915_UFENCE_WAIT_GT      2
>>>>>+#define I915_UFENCE_WAIT_GTE     3
>>>>>+#define I915_UFENCE_WAIT_LT      4
>>>>>+#define I915_UFENCE_WAIT_LTE     5
>>>>>+#define I915_UFENCE_WAIT_BEFORE  6
>>>>>+#define I915_UFENCE_WAIT_AFTER   7
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_UFENCE_WAIT_SOFT:
>>>>>+     *
>>>>>+     * To be woken up by i915 driver async worker (not by GPU).
>>>>>+     *
>>>>>+     * I915_UFENCE_WAIT_ABSTIME:
>>>>>+     *
>>>>>+     * Wait timeout specified as absolute time.
>>>>>+     */
>>>>>+    __u16 flags;
>>>>>+#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>+
>>>>>+    /** @value: Wakeup value */
>>>>>+    __u64 value;
>>>>>+
>>>>>+    /** @mask: Wakeup mask */
>>>>>+    __u64 mask;
>>>>>+#define I915_UFENCE_WAIT_U8     0xffu
>>>>>+#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>+
>>>>>+    /**
>>>>>+     * @timeout: Wait timeout in nanoseconds.
>>>>>+     *
>>>>>+     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time 
>>>>>timeout is the
>>>>>+     * absolute time in nsec.
>>>>>+     */
>>>>>+    __s64 timeout;
>>>>>+};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-08 21:32             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 21:32 UTC (permalink / raw)
  To: Matthew Auld
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>
>>On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>>On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>VM_BIND and related uapi definitions
>>>>>
>>>>>v2: Ensure proper kernel-doc formatting with cross references.
>>>>>    Also add new uapi and documentation as per review comments
>>>>>    from Daniel.
>>>>>
>>>>>Signed-off-by: Niranjana Vishwanathapura 
>>>>><niranjana.vishwanathapura@intel.com>
>>>>>---
>>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>>> 1 file changed, 399 insertions(+)
>>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>
>>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>new file mode 100644
>>>>>index 000000000000..589c0a009107
>>>>>--- /dev/null
>>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>@@ -0,0 +1,399 @@
>>>>>+/* SPDX-License-Identifier: MIT */
>>>>>+/*
>>>>>+ * Copyright © 2022 Intel Corporation
>>>>>+ */
>>>>>+
>>>>>+/**
>>>>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>>>>+ *
>>>>>+ * VM_BIND feature availability.
>>>>>+ * See typedef drm_i915_getparam_t param.
>>>>>+ */
>>>>>+#define I915_PARAM_HAS_VM_BIND        57
>>>>>+
>>>>>+/**
>>>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>+ *
>>>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>+ * See struct drm_i915_gem_vm_control flags.
>>>>>+ *
>>>>>+ * A VM in VM_BIND mode will not support the older execbuff 
>>>>>mode of binding.
>>>>>+ * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>execlist (ie., the
>>>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>be provided
>>>>>+ * to pass in the batch buffer addresses.
>>>>>+ *
>>>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>>>>+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>must always be
>>>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>+ * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>batch_len fields
>>>>>+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>>>>+ */
>>>>>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>+
>>>>>+/**
>>>>>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>+ *
>>>>>+ * Flag to declare context as long running.
>>>>>+ * See struct drm_i915_gem_context_create_ext flags.
>>>>>+ *
>>>>>+ * Usage of dma-fence expects that they complete in 
>>>>>reasonable amount of time.
>>>>>+ * Compute on the other hand can be long running. Hence it is 
>>>>>not appropriate
>>>>>+ * for compute contexts to export request completion 
>>>>>dma-fence to user.
>>>>>+ * The dma-fence usage will be limited to in-kernel consumption only.
>>>>>+ * Compute contexts need to use user/memory fence.
>>>>>+ *
>>>>>+ * So, long running contexts do not support output fences. Hence,
>>>>>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>+ * I915_EXEC_FENCE_SIGNAL (See 
>>>>>&drm_i915_gem_exec_fence.flags) are expected
>>>>>+ * to be not used.
>>>>>+ *
>>>>>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for 
>>>>>objects mapped
>>>>>+ * to long running contexts.
>>>>>+ */
>>>>>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>+
>>>>>+/* VM_BIND related ioctls */
>>>>>+#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>+#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>+#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>+
>>>>>+#define DRM_IOCTL_I915_GEM_VM_BIND        
>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct 
>>>>>drm_i915_gem_vm_bind)
>>>>>+#define DRM_IOCTL_I915_GEM_VM_UNBIND 
>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct 
>>>>>drm_i915_gem_vm_bind)
>>>>>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, 
>>>>>struct drm_i915_gem_wait_user_fence)
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>+ *
>>>>>+ * This structure is passed to VM_BIND ioctl and specifies 
>>>>>the mapping of GPU
>>>>>+ * virtual address (VA) range to the section of an object 
>>>>>that should be bound
>>>>>+ * in the device page table of the specified address space (VM).
>>>>>+ * The VA range specified must be unique (ie., not currently 
>>>>>bound) and can
>>>>>+ * be mapped to whole object or a section of the object 
>>>>>(partial binding).
>>>>>+ * Multiple VA mappings can be created to the same section of 
>>>>>the object
>>>>>+ * (aliasing).
>>>>>+ */
>>>>>+struct drm_i915_gem_vm_bind {
>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>+    __u32 vm_id;
>>>>>+
>>>>>+    /** @handle: Object handle */
>>>>>+    __u32 handle;
>>>>>+
>>>>>+    /** @start: Virtual Address start to bind */
>>>>>+    __u64 start;
>>>>>+
>>>>>+    /** @offset: Offset in object to bind */
>>>>>+    __u64 offset;
>>>>>+
>>>>>+    /** @length: Length of mapping to bind */
>>>>>+    __u64 length;
>>>>
>>>>Does it support, or should it, equivalent of 
>>>>EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map 
>>>>the remainder of the space to a dummy object? In which case 
>>>>would there be any alignment/padding issues preventing the two 
>>>>bind to be placed next to each other?
>>>>
>>>>I ask because someone from the compute side asked me about a 
>>>>problem with their strategy of dealing with overfetch and I 
>>>>suggested pad to size.
>>>>
>>>
>>>Thanks Tvrtko,
>>>I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>is completely pushed to userspace, no padding should be necessary
>>>once the 'start' and 'size' alignment conditions are met.
>>>
>>>I will add some documentation on alignment requirement here.
>>>Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>be 64K aligned.
>>
>>+ Matt
>>
>>Align to 64k is enough for all overfetch issues?
>>
>>Apparently compute has a situation where a buffer is received by one 
>>component and another has to apply more alignment to it, to deal 
>>with overfetch. Since they cannot grow the actual BO if they wanted 
>>to VM_BIND a scratch area on top? Or perhaps none of this is a 
>>problem on discrete and original BO should be correctly allocated to 
>>start with.
>>
>>Side question - what about the align to 2MiB mentioned in 
>>i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply 
>>to discrete?
>
>Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>require a minimum of 64K pages underneath for local memory, and the BO 
>size will also be rounded up accordingly. And yeah the complication 
>arises due to not being able to mix 4K + 64K GTT pages within the same 
>page-table (existed since even gen8). Note that 4K here is what we 
>typically get for system memory.
>
>Originally we had a memory coloring scheme to track the "color" of 
>each page-table, which basically ensures that userspace can't do 
>something nasty like mixing page sizes. The advantage of that scheme 
>is that we would only require 64K GTT alignment and no extra padding, 
>but is perhaps a little complex.
>
>The merged solution is just to align and pad (i.e vma->node.size and 
>not vma->size) out of the vma to 2M, which is dead simple 
>implementation wise, but does potentially waste some GTT space and 
>some of the local memory used for the actual page-table. For the 
>alignment the kernel just validates that the GTT address is aligned to 
>2M in vma_insert(), and then for the padding it just inflates it to 
>2M, if userspace hasn't already.
>
>See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext
>

Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
This is unfortunate, but it is not something new enforced by VM_BIND.
Other option is to go with 64K alignment and in VM_BIND case, user
must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
mappings in the same 2M range. But this is not VM_BIND specific
(will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is
managed fully by the user. If we enforce VA to be 2M aligned, it
will leave holes (if BOs are smaller then 2M), but nobody is going
to allocate anything form there.

Niranjana

>>
>>Regards,
>>
>>Tvrtko
>>
>>>
>>>Niranjana
>>>
>>>>Regards,
>>>>
>>>>Tvrtko
>>>>
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_GEM_VM_BIND_READONLY:
>>>>>+     * Mapping is read-only.
>>>>>+     *
>>>>>+     * I915_GEM_VM_BIND_CAPTURE:
>>>>>+     * Capture this mapping in the dump upon GPU error.
>>>>>+     */
>>>>>+    __u64 flags;
>>>>>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>+
>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>this mapping. */
>>>>>+    __u64 extensions;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>+ *
>>>>>+ * This structure is passed to VM_UNBIND ioctl and specifies 
>>>>>the GPU virtual
>>>>>+ * address (VA) range that should be unbound from the device 
>>>>>page table of the
>>>>>+ * specified address space (VM). The specified VA range must 
>>>>>match one of the
>>>>>+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>>>>+ * completion.
>>>>>+ */
>>>>>+struct drm_i915_gem_vm_unbind {
>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>+    __u32 vm_id;
>>>>>+
>>>>>+    /** @rsvd: Reserved for future use; must be zero. */
>>>>>+    __u32 rsvd;
>>>>>+
>>>>>+    /** @start: Virtual Address start to unbind */
>>>>>+    __u64 start;
>>>>>+
>>>>>+    /** @length: Length of mapping to unbind */
>>>>>+    __u64 length;
>>>>>+
>>>>>+    /** @flags: reserved for future usage, currently MBZ */
>>>>>+    __u64 flags;
>>>>>+
>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>this mapping. */
>>>>>+    __u64 extensions;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_fence - An input or output fence 
>>>>>for the vm_bind
>>>>>+ * or the vm_unbind work.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind aync worker will wait for input 
>>>>>fence to signal
>>>>>+ * before starting the binding or unbinding.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>returned output fence
>>>>>+ * after the completion of binding or unbinding.
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_fence {
>>>>>+    /** @handle: User's handle for a drm_syncobj to wait on 
>>>>>or signal. */
>>>>>+    __u32 handle;
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_VM_BIND_FENCE_WAIT:
>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>+     *
>>>>>+     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>+     * Return bind/unbind completion fence as output
>>>>>+     */
>>>>>+    __u32 flags;
>>>>>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>(-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline 
>>>>>fences for vm_bind
>>>>>+ * and vm_unbind.
>>>>>+ *
>>>>>+ * This structure describes an array of timeline drm_syncobj 
>>>>>and associated
>>>>>+ * points for timeline variants of drm_syncobj. These 
>>>>>timeline 'drm_syncobj's
>>>>>+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>+#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /**
>>>>>+     * @fence_count: Number of elements in the @handles_ptr & 
>>>>>@value_ptr
>>>>>+     * arrays.
>>>>>+     */
>>>>>+    __u64 fence_count;
>>>>>+
>>>>>+    /**
>>>>>+     * @handles_ptr: Pointer to an array of struct 
>>>>>drm_i915_vm_bind_fence
>>>>>+     * of length @fence_count.
>>>>>+     */
>>>>>+    __u64 handles_ptr;
>>>>>+
>>>>>+    /**
>>>>>+     * @values_ptr: Pointer to an array of u64 values of length
>>>>>+     * @fence_count.
>>>>>+     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>>>+     * timeline drm_syncobj is invalid as it turns a 
>>>>>drm_syncobj into a
>>>>>+     * binary one.
>>>>>+     */
>>>>>+    __u64 values_ptr;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_user_fence - An input or output 
>>>>>user fence for the
>>>>>+ * vm_bind or the vm_unbind work.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind aync worker will wait for the 
>>>>>input fence (value at
>>>>>+ * @addr to become equal to @val) before starting the binding 
>>>>>or unbinding.
>>>>>+ *
>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>output fence after
>>>>>+ * the completion of binding or unbinding by writing @val to 
>>>>>memory location at
>>>>>+ * @addr
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_user_fence {
>>>>>+    /** @addr: User/Memory fence qword aligned process 
>>>>>virtual address */
>>>>>+    __u64 addr;
>>>>>+
>>>>>+    /** @val: User/Memory fence value to be written after 
>>>>>bind completion */
>>>>>+    __u64 val;
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>+     *
>>>>>+     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>+     * Return bind/unbind completion fence as output
>>>>>+     */
>>>>>+    __u32 flags;
>>>>>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>+    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory 
>>>>>fences for vm_bind
>>>>>+ * and vm_unbind.
>>>>>+ *
>>>>>+ * These user fences can be input or output fences
>>>>>+ * (See struct drm_i915_vm_bind_user_fence).
>>>>>+ */
>>>>>+struct drm_i915_vm_bind_ext_user_fence {
>>>>>+#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /** @fence_count: Number of elements in the 
>>>>>@user_fence_ptr array. */
>>>>>+    __u64 fence_count;
>>>>>+
>>>>>+    /**
>>>>>+     * @user_fence_ptr: Pointer to an array of
>>>>>+     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>+     */
>>>>>+    __u64 user_fence_ptr;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array 
>>>>>of batch buffer
>>>>>+ * gpu virtual addresses.
>>>>>+ *
>>>>>+ * In the execbuff ioctl (See struct 
>>>>>drm_i915_gem_execbuffer2), this extension
>>>>>+ * must always be appended in the VM_BIND mode and it will be 
>>>>>an error to
>>>>>+ * append this extension in older non-VM_BIND mode.
>>>>>+ */
>>>>>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /** @count: Number of addresses in the addr array. */
>>>>>+    __u32 count;
>>>>>+
>>>>>+    /** @addr: An array of batch gpu virtual addresses. */
>>>>>+    __u64 addr[0];
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First 
>>>>>level batch completion
>>>>>+ * signaling extension.
>>>>>+ *
>>>>>+ * This extension allows user to attach a user fence (@addr, 
>>>>>@value pair) to an
>>>>>+ * execbuf to be signaled by the command streamer after the 
>>>>>completion of first
>>>>>+ * level batch, by writing the @value at specified @addr and 
>>>>>triggering an
>>>>>+ * interrupt.
>>>>>+ * User can either poll for this user fence to signal or can 
>>>>>also wait on it
>>>>>+ * with i915_gem_wait_user_fence ioctl.
>>>>>+ * This is very much usefaul for long running contexts where 
>>>>>waiting on dma-fence
>>>>>+ * by user (like i915_gem_wait ioctl) is not supported.
>>>>>+ */
>>>>>+struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /**
>>>>>+     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>+     *
>>>>>+     * Address has to be a valid GPU virtual address at the time of
>>>>>+     * first level batch completion.
>>>>>+     */
>>>>>+    __u64 addr;
>>>>>+
>>>>>+    /**
>>>>>+     * @value: User/Memory fence Value to be written to above address
>>>>>+     * after first level batch completes.
>>>>>+     */
>>>>>+    __u64 value;
>>>>>+
>>>>>+    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>+    __u64 rsvd;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_create_ext_vm_private - Extension to 
>>>>>make the object
>>>>>+ * private to the specified VM.
>>>>>+ *
>>>>>+ * See struct drm_i915_gem_create_ext.
>>>>>+ */
>>>>>+struct drm_i915_gem_create_ext_vm_private {
>>>>>+#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>+    struct i915_user_extension base;
>>>>>+
>>>>>+    /** @vm_id: Id of the VM to which the object is private */
>>>>>+    __u32 vm_id;
>>>>>+};
>>>>>+
>>>>>+/**
>>>>>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>>>+ *
>>>>>+ * User/Memory fence can be woken up either by:
>>>>>+ *
>>>>>+ * 1. GPU context indicated by @ctx_id, or,
>>>>>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>+ *    @ctx_id is ignored when this flag is set.
>>>>>+ *
>>>>>+ * Wakeup condition is,
>>>>>+ * ``((*addr & mask) op (value & mask))``
>>>>>+ *
>>>>>+ * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>><indefinite_dma_fences>`
>>>>>+ */
>>>>>+struct drm_i915_gem_wait_user_fence {
>>>>>+    /** @extensions: Zero-terminated chain of extensions. */
>>>>>+    __u64 extensions;
>>>>>+
>>>>>+    /** @addr: User/Memory fence address */
>>>>>+    __u64 addr;
>>>>>+
>>>>>+    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>+    __u32 ctx_id;
>>>>>+
>>>>>+    /** @op: Wakeup condition operator */
>>>>>+    __u16 op;
>>>>>+#define I915_UFENCE_WAIT_EQ      0
>>>>>+#define I915_UFENCE_WAIT_NEQ     1
>>>>>+#define I915_UFENCE_WAIT_GT      2
>>>>>+#define I915_UFENCE_WAIT_GTE     3
>>>>>+#define I915_UFENCE_WAIT_LT      4
>>>>>+#define I915_UFENCE_WAIT_LTE     5
>>>>>+#define I915_UFENCE_WAIT_BEFORE  6
>>>>>+#define I915_UFENCE_WAIT_AFTER   7
>>>>>+
>>>>>+    /**
>>>>>+     * @flags: Supported flags are,
>>>>>+     *
>>>>>+     * I915_UFENCE_WAIT_SOFT:
>>>>>+     *
>>>>>+     * To be woken up by i915 driver async worker (not by GPU).
>>>>>+     *
>>>>>+     * I915_UFENCE_WAIT_ABSTIME:
>>>>>+     *
>>>>>+     * Wait timeout specified as absolute time.
>>>>>+     */
>>>>>+    __u16 flags;
>>>>>+#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>+
>>>>>+    /** @value: Wakeup value */
>>>>>+    __u64 value;
>>>>>+
>>>>>+    /** @mask: Wakeup mask */
>>>>>+    __u64 mask;
>>>>>+#define I915_UFENCE_WAIT_U8     0xffu
>>>>>+#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>+
>>>>>+    /**
>>>>>+     * @timeout: Wait timeout in nanoseconds.
>>>>>+     *
>>>>>+     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time 
>>>>>timeout is the
>>>>>+     * absolute time in nsec.
>>>>>+     */
>>>>>+    __s64 timeout;
>>>>>+};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-08  7:33                     ` Tvrtko Ursulin
@ 2022-06-08 21:44                         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 21:44 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>   >
>>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>>>>   >
>>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew 
>>>>Brost wrote:
>>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>>   wrote:
>>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>   binding/unbinding
>>>>   >       the mapping in an
>>>>   >       >> > +async worker. The binding and unbinding will 
>>>>work like a
>>>>   special
>>>>   >       GPU engine.
>>>>   >       >> > +The binding and unbinding operations are serialized and
>>>>   will
>>>>   >       wait on specified
>>>>   >       >> > +input fences before the operation and will signal the
>>>>   output
>>>>   >       fences upon the
>>>>   >       >> > +completion of the operation. Due to serialization,
>>>>   completion of
>>>>   >       an operation
>>>>   >       >> > +will also indicate that all previous operations 
>>>>are also
>>>>   >       complete.
>>>>   >       >>
>>>>   >       >> I guess we should avoid saying "will immediately start
>>>>   >       binding/unbinding" if
>>>>   >       >> there are fences involved.
>>>>   >       >>
>>>>   >       >> And the fact that it's happening in an async 
>>>>worker seem to
>>>>   imply
>>>>   >       it's not
>>>>   >       >> immediate.
>>>>   >       >>
>>>>   >
>>>>   >       Ok, will fix.
>>>>   >       This was added because in earlier design binding was deferred
>>>>   until
>>>>   >       next execbuff.
>>>>   >       But now it is non-deferred (immediate in that sense). 
>>>>But yah,
>>>>   this is
>>>>   >       confusing
>>>>   >       and will fix it.
>>>>   >
>>>>   >       >>
>>>>   >       >> I have a question on the behavior of the bind 
>>>>operation when
>>>>   no
>>>>   >       input fence
>>>>   >       >> is provided. Let say I do :
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence1)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence3)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> In what order are the fences going to be signaled?
>>>>   >       >>
>>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>>   >       >>
>>>>   >       >> Because you wrote "serialized I assume it's : in order
>>>>   >       >>
>>>>   >
>>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that 
>>>>bind and
>>>>   unbind
>>>>   >       will use
>>>>   >       the same queue and hence are ordered.
>>>>   >
>>>>   >       >>
>>>>   >       >> One thing I didn't realize is that because we only get one
>>>>   >       "VM_BIND" engine,
>>>>   >       >> there is a disconnect from the Vulkan specification.
>>>>   >       >>
>>>>   >       >> In Vulkan VM_BIND operations are serialized but 
>>>>per engine.
>>>>   >       >>
>>>>   >       >> So you could have something like this :
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> fence1 is not signaled
>>>>   >       >>
>>>>   >       >> fence3 is signaled
>>>>   >       >>
>>>>   >       >> So the second VM_BIND will proceed before the 
>>>>first VM_BIND.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> I guess we can deal with that scenario in 
>>>>userspace by doing
>>>>   the
>>>>   >       wait
>>>>   >       >> ourselves in one thread per engines.
>>>>   >       >>
>>>>   >       >> But then it makes the VM_BIND input fences useless.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> Daniel : what do you think? Should be rework this or just
>>>>   deal with
>>>>   >       wait
>>>>   >       >> fences in userspace?
>>>>   >       >>
>>>>   >       >
>>>>   >       >My opinion is rework this but make the ordering via 
>>>>an engine
>>>>   param
>>>>   >       optional.
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds are ordered 
>>>>within the
>>>>   VM
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds accept an engine
>>>>   argument
>>>>   >       (in
>>>>   >       >the case of the i915 likely this is a gem context 
>>>>handle) and
>>>>   binds
>>>>   >       >ordered with respect to that engine.
>>>>   >       >
>>>>   >       >This gives UMDs options as the later likely consumes 
>>>>more KMD
>>>>   >       resources
>>>>   >       >so if a different UMD can live with binds being 
>>>>ordered within
>>>>   the VM
>>>>   >       >they can use a mode consuming less resources.
>>>>   >       >
>>>>   >
>>>>   >       I think we need to be careful here if we are looking for some
>>>>   out of
>>>>   >       (submission) order completion of vm_bind/unbind.
>>>>   >       In-order completion means, in a batch of binds and 
>>>>unbinds to be
>>>>   >       completed in-order, user only needs to specify 
>>>>in-fence for the
>>>>   >       first bind/unbind call and the our-fence for the last
>>>>   bind/unbind
>>>>   >       call. Also, the VA released by an unbind call can be 
>>>>re-used by
>>>>   >       any subsequent bind call in that in-order batch.
>>>>   >
>>>>   >       These things will break if binding/unbinding were to 
>>>>be allowed
>>>>   to
>>>>   >       go out of order (of submission) and user need to be extra
>>>>   careful
>>>>   >       not to run into pre-mature triggereing of out-fence and bind
>>>>   failing
>>>>   >       as VA is still in use etc.
>>>>   >
>>>>   >       Also, VM_BIND binds the provided mapping on the specified
>>>>   address
>>>>   >       space
>>>>   >       (VM). So, the uapi is not engine/context specific.
>>>>   >
>>>>   >       We can however add a 'queue' to the uapi which can be 
>>>>one from
>>>>   the
>>>>   >       pre-defined queues,
>>>>   >       I915_VM_BIND_QUEUE_0
>>>>   >       I915_VM_BIND_QUEUE_1
>>>>   >       ...
>>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>   >
>>>>   >       KMD will spawn an async work queue for each queue which will
>>>>   only
>>>>   >       bind the mappings on that queue in the order of submission.
>>>>   >       User can assign the queue to per engine or anything 
>>>>like that.
>>>>   >
>>>>   >       But again here, user need to be careful and not 
>>>>deadlock these
>>>>   >       queues with circular dependency of fences.
>>>>   >
>>>>   >       I prefer adding this later an as extension based on 
>>>>whether it
>>>>   >       is really helping with the implementation.
>>>>   >
>>>>   >     I can tell you right now that having everything on a single
>>>>   in-order
>>>>   >     queue will not get us the perf we want.  What vulkan 
>>>>really wants
>>>>   is one
>>>>   >     of two things:
>>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>>>   whatever
>>>>   >     their dependencies are resolved and we ensure ordering 
>>>>ourselves
>>>>   by
>>>>   >     having a syncobj in the VkQueue.
>>>>   >      2. The ability to create multiple VM_BIND queues.  We need at
>>>>   least 2
>>>>   >     but I don't see why there needs to be a limit besides 
>>>>the limits
>>>>   the
>>>>   >     i915 API already has on the number of engines.  Vulkan could
>>>>   expose
>>>>   >     multiple sparse binding queues to the client if it's not
>>>>   arbitrarily
>>>>   >     limited.
>>>>
>>>>   Thanks Jason, Lionel.
>>>>
>>>>   Jason, what are you referring to when you say "limits the i915 API
>>>>   already
>>>>   has on the number of engines"? I am not sure if there is such an uapi
>>>>   today.
>>>>
>>>> There's a limit of something like 64 total engines today based on the
>>>> number of bits we can cram into the exec flags in execbuffer2.  I think
>>>> someone had an extended version that allowed more but I ripped it out
>>>> because no one was using it.  Of course, execbuffer3 might not 
>>>>have that
>>>> problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
>>will also have. So, we can simply define in vm_bind/unbind structures,
>>
>>#define I915_VM_BIND_MAX_QUEUE   64
>>        __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines 
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>	if (set.num_engines > I915_EXEC_RING_MASK + 1)
>		return -EINVAL;
>
>To context creation needs to be undone and so let users create engine 
>maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues
need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index.
Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>   I am trying to see how many queues we need and don't want it to be
>>>>   arbitrarily
>>>>   large and unduely blow up memory usage and complexity in i915 driver.
>>>>
>>>> I expect a Vulkan driver to use at most 2 in the vast majority 
>>>>of cases. I
>>>> could imagine a client wanting to create more than 1 sparse 
>>>>queue in which
>>>> case, it'll be N+1 but that's unlikely.  As far as complexity 
>>>>goes, once
>>>> you allow two, I don't think the complexity is going up by 
>>>>allowing N.  As
>>>> for memory usage, creating more queues means more memory.  That's a
>>>> trade-off that userspace can make.  Again, the expected number 
>>>>here is 1
>>>> or 2 in the vast majority of cases so I don't think you need to worry.
>>>
>>>Ok, will start with n=3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>   >     Why?  Because Vulkan has two basic kind of bind 
>>>>operations and we
>>>>   don't
>>>>   >     want any dependencies between them:
>>>>   >      1. Immediate.  These happen right after BO creation or 
>>>>maybe as
>>>>   part of
>>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These 
>>>>don't happen
>>>>   on a
>>>>   >     queue and we don't want them serialized with anything.  To
>>>>   synchronize
>>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>>>>   signaled by
>>>>   >     all immediate bind operations and make submits wait on it.
>>>>   >      2. Queued (sparse): These happen on a VkQueue which may be the
>>>>   same as
>>>>   >     a render/compute queue or may be its own queue.  It's up to us
>>>>   what we
>>>>   >     want to advertise.  From the Vulkan API PoV, this is like any
>>>>   other
>>>>   >     queue.  Operations on it wait on and signal semaphores.  If we
>>>>   have a
>>>>   >     VM_BIND engine, we'd provide syncobjs to wait and 
>>>>signal just like
>>>>   we do
>>>>   >     in execbuf().
>>>>   >     The important thing is that we don't want one type of 
>>>>operation to
>>>>   block
>>>>   >     on the other.  If immediate binds are blocking on sparse binds,
>>>>   it's
>>>>   >     going to cause over-synchronization issues.
>>>>   >     In terms of the internal implementation, I know that 
>>>>there's going
>>>>   to be
>>>>   >     a lock on the VM and that we can't actually do these things in
>>>>   >     parallel.  That's fine.  Once the dma_fences have signaled and
>>>>   we're
>>>>
>>>>   Thats correct. It is like a single VM_BIND engine with 
>>>>multiple queues
>>>>   feeding to it.
>>>>
>>>> Right.  As long as the queues themselves are independent and 
>>>>can block on
>>>> dma_fences without holding up other queues, I think we're fine.
>>>>
>>>>   >     unblocked to do the bind operation, I don't care if 
>>>>there's a bit
>>>>   of
>>>>   >     synchronization due to locking.  That's expected.  What 
>>>>we can't
>>>>   afford
>>>>   >     to have is an immediate bind operation suddenly blocking on a
>>>>   sparse
>>>>   >     operation which is blocked on a compute job that's going to run
>>>>   for
>>>>   >     another 5ms.
>>>>
>>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>>>   VM_BIND
>>>>   on other VMs. I am not sure about usecases here, but just wanted to
>>>>   clarify.
>>>>
>>>> Yes, that's what I would expect.
>>>> --Jason
>>>>
>>>>   Niranjana
>>>>
>>>>   >     For reference, Windows solves this by allowing arbitrarily many
>>>>   paging
>>>>   >     queues (what they call a VM_BIND engine/queue).  That 
>>>>design works
>>>>   >     pretty well and solves the problems in question.  
>>>>Again, we could
>>>>   just
>>>>   >     make everything out-of-order and require using syncobjs 
>>>>to order
>>>>   things
>>>>   >     as userspace wants. That'd be fine too.
>>>>   >     One more note while I'm here: danvet said something on 
>>>>IRC about
>>>>   VM_BIND
>>>>   >     queues waiting for syncobjs to materialize.  We don't really
>>>>   want/need
>>>>   >     this.  We already have all the machinery in userspace to handle
>>>>   >     wait-before-signal and waiting for syncobj fences to 
>>>>materialize
>>>>   and
>>>>   >     that machinery is on by default.  It would actually 
>>>>take MORE work
>>>>   in
>>>>   >     Mesa to turn it off and take advantage of the kernel 
>>>>being able to
>>>>   wait
>>>>   >     for syncobjs to materialize.  Also, getting that right is
>>>>   ridiculously
>>>>   >     hard and I really don't want to get it wrong in kernel 
>>>>space.     When we
>>>>   >     do memory fences, wait-before-signal will be a thing.  We don't
>>>>   need to
>>>>   >     try and make it a thing for syncobj.
>>>>   >     --Jason
>>>>   >
>>>>   >   Thanks Jason,
>>>>   >
>>>>   >   I missed the bit in the Vulkan spec that we're allowed to have a
>>>>   sparse
>>>>   >   queue that does not implement either graphics or compute 
>>>>operations
>>>>   :
>>>>   >
>>>>   >     "While some implementations may include
>>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>   >     support in queue families that also include
>>>>   >
>>>>   >      graphics and compute support, other implementations may only
>>>>   expose a
>>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>   >
>>>>   >      family."
>>>>   >
>>>>   >   So it can all be all a vm_bind engine that just does bind/unbind
>>>>   >   operations.
>>>>   >
>>>>   >   But yes we need another engine for the immediate/non-sparse
>>>>   operations.
>>>>   >
>>>>   >   -Lionel
>>>>   >
>>>>   >         >
>>>>   >       Daniel, any thoughts?
>>>>   >
>>>>   >       Niranjana
>>>>   >
>>>>   >       >Matt
>>>>   >       >
>>>>   >       >>
>>>>   >       >> Sorry I noticed this late.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> -Lionel
>>>>   >       >>
>>>>   >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-08 21:44                         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 21:44 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>   >
>>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>>>>   >
>>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew 
>>>>Brost wrote:
>>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>>   wrote:
>>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>   binding/unbinding
>>>>   >       the mapping in an
>>>>   >       >> > +async worker. The binding and unbinding will 
>>>>work like a
>>>>   special
>>>>   >       GPU engine.
>>>>   >       >> > +The binding and unbinding operations are serialized and
>>>>   will
>>>>   >       wait on specified
>>>>   >       >> > +input fences before the operation and will signal the
>>>>   output
>>>>   >       fences upon the
>>>>   >       >> > +completion of the operation. Due to serialization,
>>>>   completion of
>>>>   >       an operation
>>>>   >       >> > +will also indicate that all previous operations 
>>>>are also
>>>>   >       complete.
>>>>   >       >>
>>>>   >       >> I guess we should avoid saying "will immediately start
>>>>   >       binding/unbinding" if
>>>>   >       >> there are fences involved.
>>>>   >       >>
>>>>   >       >> And the fact that it's happening in an async 
>>>>worker seem to
>>>>   imply
>>>>   >       it's not
>>>>   >       >> immediate.
>>>>   >       >>
>>>>   >
>>>>   >       Ok, will fix.
>>>>   >       This was added because in earlier design binding was deferred
>>>>   until
>>>>   >       next execbuff.
>>>>   >       But now it is non-deferred (immediate in that sense). 
>>>>But yah,
>>>>   this is
>>>>   >       confusing
>>>>   >       and will fix it.
>>>>   >
>>>>   >       >>
>>>>   >       >> I have a question on the behavior of the bind 
>>>>operation when
>>>>   no
>>>>   >       input fence
>>>>   >       >> is provided. Let say I do :
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence1)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence3)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> In what order are the fences going to be signaled?
>>>>   >       >>
>>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>>   >       >>
>>>>   >       >> Because you wrote "serialized I assume it's : in order
>>>>   >       >>
>>>>   >
>>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that 
>>>>bind and
>>>>   unbind
>>>>   >       will use
>>>>   >       the same queue and hence are ordered.
>>>>   >
>>>>   >       >>
>>>>   >       >> One thing I didn't realize is that because we only get one
>>>>   >       "VM_BIND" engine,
>>>>   >       >> there is a disconnect from the Vulkan specification.
>>>>   >       >>
>>>>   >       >> In Vulkan VM_BIND operations are serialized but 
>>>>per engine.
>>>>   >       >>
>>>>   >       >> So you could have something like this :
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> fence1 is not signaled
>>>>   >       >>
>>>>   >       >> fence3 is signaled
>>>>   >       >>
>>>>   >       >> So the second VM_BIND will proceed before the 
>>>>first VM_BIND.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> I guess we can deal with that scenario in 
>>>>userspace by doing
>>>>   the
>>>>   >       wait
>>>>   >       >> ourselves in one thread per engines.
>>>>   >       >>
>>>>   >       >> But then it makes the VM_BIND input fences useless.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> Daniel : what do you think? Should be rework this or just
>>>>   deal with
>>>>   >       wait
>>>>   >       >> fences in userspace?
>>>>   >       >>
>>>>   >       >
>>>>   >       >My opinion is rework this but make the ordering via 
>>>>an engine
>>>>   param
>>>>   >       optional.
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds are ordered 
>>>>within the
>>>>   VM
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds accept an engine
>>>>   argument
>>>>   >       (in
>>>>   >       >the case of the i915 likely this is a gem context 
>>>>handle) and
>>>>   binds
>>>>   >       >ordered with respect to that engine.
>>>>   >       >
>>>>   >       >This gives UMDs options as the later likely consumes 
>>>>more KMD
>>>>   >       resources
>>>>   >       >so if a different UMD can live with binds being 
>>>>ordered within
>>>>   the VM
>>>>   >       >they can use a mode consuming less resources.
>>>>   >       >
>>>>   >
>>>>   >       I think we need to be careful here if we are looking for some
>>>>   out of
>>>>   >       (submission) order completion of vm_bind/unbind.
>>>>   >       In-order completion means, in a batch of binds and 
>>>>unbinds to be
>>>>   >       completed in-order, user only needs to specify 
>>>>in-fence for the
>>>>   >       first bind/unbind call and the our-fence for the last
>>>>   bind/unbind
>>>>   >       call. Also, the VA released by an unbind call can be 
>>>>re-used by
>>>>   >       any subsequent bind call in that in-order batch.
>>>>   >
>>>>   >       These things will break if binding/unbinding were to 
>>>>be allowed
>>>>   to
>>>>   >       go out of order (of submission) and user need to be extra
>>>>   careful
>>>>   >       not to run into pre-mature triggereing of out-fence and bind
>>>>   failing
>>>>   >       as VA is still in use etc.
>>>>   >
>>>>   >       Also, VM_BIND binds the provided mapping on the specified
>>>>   address
>>>>   >       space
>>>>   >       (VM). So, the uapi is not engine/context specific.
>>>>   >
>>>>   >       We can however add a 'queue' to the uapi which can be 
>>>>one from
>>>>   the
>>>>   >       pre-defined queues,
>>>>   >       I915_VM_BIND_QUEUE_0
>>>>   >       I915_VM_BIND_QUEUE_1
>>>>   >       ...
>>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>   >
>>>>   >       KMD will spawn an async work queue for each queue which will
>>>>   only
>>>>   >       bind the mappings on that queue in the order of submission.
>>>>   >       User can assign the queue to per engine or anything 
>>>>like that.
>>>>   >
>>>>   >       But again here, user need to be careful and not 
>>>>deadlock these
>>>>   >       queues with circular dependency of fences.
>>>>   >
>>>>   >       I prefer adding this later an as extension based on 
>>>>whether it
>>>>   >       is really helping with the implementation.
>>>>   >
>>>>   >     I can tell you right now that having everything on a single
>>>>   in-order
>>>>   >     queue will not get us the perf we want.  What vulkan 
>>>>really wants
>>>>   is one
>>>>   >     of two things:
>>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>>>   whatever
>>>>   >     their dependencies are resolved and we ensure ordering 
>>>>ourselves
>>>>   by
>>>>   >     having a syncobj in the VkQueue.
>>>>   >      2. The ability to create multiple VM_BIND queues.  We need at
>>>>   least 2
>>>>   >     but I don't see why there needs to be a limit besides 
>>>>the limits
>>>>   the
>>>>   >     i915 API already has on the number of engines.  Vulkan could
>>>>   expose
>>>>   >     multiple sparse binding queues to the client if it's not
>>>>   arbitrarily
>>>>   >     limited.
>>>>
>>>>   Thanks Jason, Lionel.
>>>>
>>>>   Jason, what are you referring to when you say "limits the i915 API
>>>>   already
>>>>   has on the number of engines"? I am not sure if there is such an uapi
>>>>   today.
>>>>
>>>> There's a limit of something like 64 total engines today based on the
>>>> number of bits we can cram into the exec flags in execbuffer2.  I think
>>>> someone had an extended version that allowed more but I ripped it out
>>>> because no one was using it.  Of course, execbuffer3 might not 
>>>>have that
>>>> problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
>>will also have. So, we can simply define in vm_bind/unbind structures,
>>
>>#define I915_VM_BIND_MAX_QUEUE   64
>>        __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines 
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>	if (set.num_engines > I915_EXEC_RING_MASK + 1)
>		return -EINVAL;
>
>To context creation needs to be undone and so let users create engine 
>maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues
need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index.
Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>   I am trying to see how many queues we need and don't want it to be
>>>>   arbitrarily
>>>>   large and unduely blow up memory usage and complexity in i915 driver.
>>>>
>>>> I expect a Vulkan driver to use at most 2 in the vast majority 
>>>>of cases. I
>>>> could imagine a client wanting to create more than 1 sparse 
>>>>queue in which
>>>> case, it'll be N+1 but that's unlikely.  As far as complexity 
>>>>goes, once
>>>> you allow two, I don't think the complexity is going up by 
>>>>allowing N.  As
>>>> for memory usage, creating more queues means more memory.  That's a
>>>> trade-off that userspace can make.  Again, the expected number 
>>>>here is 1
>>>> or 2 in the vast majority of cases so I don't think you need to worry.
>>>
>>>Ok, will start with n=3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>   >     Why?  Because Vulkan has two basic kind of bind 
>>>>operations and we
>>>>   don't
>>>>   >     want any dependencies between them:
>>>>   >      1. Immediate.  These happen right after BO creation or 
>>>>maybe as
>>>>   part of
>>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These 
>>>>don't happen
>>>>   on a
>>>>   >     queue and we don't want them serialized with anything.  To
>>>>   synchronize
>>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>>>>   signaled by
>>>>   >     all immediate bind operations and make submits wait on it.
>>>>   >      2. Queued (sparse): These happen on a VkQueue which may be the
>>>>   same as
>>>>   >     a render/compute queue or may be its own queue.  It's up to us
>>>>   what we
>>>>   >     want to advertise.  From the Vulkan API PoV, this is like any
>>>>   other
>>>>   >     queue.  Operations on it wait on and signal semaphores.  If we
>>>>   have a
>>>>   >     VM_BIND engine, we'd provide syncobjs to wait and 
>>>>signal just like
>>>>   we do
>>>>   >     in execbuf().
>>>>   >     The important thing is that we don't want one type of 
>>>>operation to
>>>>   block
>>>>   >     on the other.  If immediate binds are blocking on sparse binds,
>>>>   it's
>>>>   >     going to cause over-synchronization issues.
>>>>   >     In terms of the internal implementation, I know that 
>>>>there's going
>>>>   to be
>>>>   >     a lock on the VM and that we can't actually do these things in
>>>>   >     parallel.  That's fine.  Once the dma_fences have signaled and
>>>>   we're
>>>>
>>>>   Thats correct. It is like a single VM_BIND engine with 
>>>>multiple queues
>>>>   feeding to it.
>>>>
>>>> Right.  As long as the queues themselves are independent and 
>>>>can block on
>>>> dma_fences without holding up other queues, I think we're fine.
>>>>
>>>>   >     unblocked to do the bind operation, I don't care if 
>>>>there's a bit
>>>>   of
>>>>   >     synchronization due to locking.  That's expected.  What 
>>>>we can't
>>>>   afford
>>>>   >     to have is an immediate bind operation suddenly blocking on a
>>>>   sparse
>>>>   >     operation which is blocked on a compute job that's going to run
>>>>   for
>>>>   >     another 5ms.
>>>>
>>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>>>   VM_BIND
>>>>   on other VMs. I am not sure about usecases here, but just wanted to
>>>>   clarify.
>>>>
>>>> Yes, that's what I would expect.
>>>> --Jason
>>>>
>>>>   Niranjana
>>>>
>>>>   >     For reference, Windows solves this by allowing arbitrarily many
>>>>   paging
>>>>   >     queues (what they call a VM_BIND engine/queue).  That 
>>>>design works
>>>>   >     pretty well and solves the problems in question.  
>>>>Again, we could
>>>>   just
>>>>   >     make everything out-of-order and require using syncobjs 
>>>>to order
>>>>   things
>>>>   >     as userspace wants. That'd be fine too.
>>>>   >     One more note while I'm here: danvet said something on 
>>>>IRC about
>>>>   VM_BIND
>>>>   >     queues waiting for syncobjs to materialize.  We don't really
>>>>   want/need
>>>>   >     this.  We already have all the machinery in userspace to handle
>>>>   >     wait-before-signal and waiting for syncobj fences to 
>>>>materialize
>>>>   and
>>>>   >     that machinery is on by default.  It would actually 
>>>>take MORE work
>>>>   in
>>>>   >     Mesa to turn it off and take advantage of the kernel 
>>>>being able to
>>>>   wait
>>>>   >     for syncobjs to materialize.  Also, getting that right is
>>>>   ridiculously
>>>>   >     hard and I really don't want to get it wrong in kernel 
>>>>space.     When we
>>>>   >     do memory fences, wait-before-signal will be a thing.  We don't
>>>>   need to
>>>>   >     try and make it a thing for syncobj.
>>>>   >     --Jason
>>>>   >
>>>>   >   Thanks Jason,
>>>>   >
>>>>   >   I missed the bit in the Vulkan spec that we're allowed to have a
>>>>   sparse
>>>>   >   queue that does not implement either graphics or compute 
>>>>operations
>>>>   :
>>>>   >
>>>>   >     "While some implementations may include
>>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>   >     support in queue families that also include
>>>>   >
>>>>   >      graphics and compute support, other implementations may only
>>>>   expose a
>>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>   >
>>>>   >      family."
>>>>   >
>>>>   >   So it can all be all a vm_bind engine that just does bind/unbind
>>>>   >   operations.
>>>>   >
>>>>   >   But yes we need another engine for the immediate/non-sparse
>>>>   operations.
>>>>   >
>>>>   >   -Lionel
>>>>   >
>>>>   >         >
>>>>   >       Daniel, any thoughts?
>>>>   >
>>>>   >       Niranjana
>>>>   >
>>>>   >       >Matt
>>>>   >       >
>>>>   >       >>
>>>>   >       >> Sorry I noticed this late.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> -Lionel
>>>>   >       >>
>>>>   >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-08 21:44                         ` Niranjana Vishwanathapura
@ 2022-06-08 21:55                           ` Jason Ekstrand
  -1 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-08 21:55 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Tvrtko Ursulin, Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 17462 bytes --]

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >
> >
> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
> wrote:
> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>
> >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>   >
> >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
> >>>>   >
> >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>Brost wrote:
> >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
> >>>>   wrote:
> >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>   binding/unbinding
> >>>>   >       the mapping in an
> >>>>   >       >> > +async worker. The binding and unbinding will
> >>>>work like a
> >>>>   special
> >>>>   >       GPU engine.
> >>>>   >       >> > +The binding and unbinding operations are serialized
> and
> >>>>   will
> >>>>   >       wait on specified
> >>>>   >       >> > +input fences before the operation and will signal the
> >>>>   output
> >>>>   >       fences upon the
> >>>>   >       >> > +completion of the operation. Due to serialization,
> >>>>   completion of
> >>>>   >       an operation
> >>>>   >       >> > +will also indicate that all previous operations
> >>>>are also
> >>>>   >       complete.
> >>>>   >       >>
> >>>>   >       >> I guess we should avoid saying "will immediately start
> >>>>   >       binding/unbinding" if
> >>>>   >       >> there are fences involved.
> >>>>   >       >>
> >>>>   >       >> And the fact that it's happening in an async
> >>>>worker seem to
> >>>>   imply
> >>>>   >       it's not
> >>>>   >       >> immediate.
> >>>>   >       >>
> >>>>   >
> >>>>   >       Ok, will fix.
> >>>>   >       This was added because in earlier design binding was
> deferred
> >>>>   until
> >>>>   >       next execbuff.
> >>>>   >       But now it is non-deferred (immediate in that sense).
> >>>>But yah,
> >>>>   this is
> >>>>   >       confusing
> >>>>   >       and will fix it.
> >>>>   >
> >>>>   >       >>
> >>>>   >       >> I have a question on the behavior of the bind
> >>>>operation when
> >>>>   no
> >>>>   >       input fence
> >>>>   >       >> is provided. Let say I do :
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence1)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence2)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence3)
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> In what order are the fences going to be signaled?
> >>>>   >       >>
> >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
> >>>>   >       >>
> >>>>   >       >> Because you wrote "serialized I assume it's : in order
> >>>>   >       >>
> >>>>   >
> >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
> >>>>bind and
> >>>>   unbind
> >>>>   >       will use
> >>>>   >       the same queue and hence are ordered.
> >>>>   >
> >>>>   >       >>
> >>>>   >       >> One thing I didn't realize is that because we only get
> one
> >>>>   >       "VM_BIND" engine,
> >>>>   >       >> there is a disconnect from the Vulkan specification.
> >>>>   >       >>
> >>>>   >       >> In Vulkan VM_BIND operations are serialized but
> >>>>per engine.
> >>>>   >       >>
> >>>>   >       >> So you could have something like this :
> >>>>   >       >>
> >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> fence1 is not signaled
> >>>>   >       >>
> >>>>   >       >> fence3 is signaled
> >>>>   >       >>
> >>>>   >       >> So the second VM_BIND will proceed before the
> >>>>first VM_BIND.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> I guess we can deal with that scenario in
> >>>>userspace by doing
> >>>>   the
> >>>>   >       wait
> >>>>   >       >> ourselves in one thread per engines.
> >>>>   >       >>
> >>>>   >       >> But then it makes the VM_BIND input fences useless.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> Daniel : what do you think? Should be rework this or just
> >>>>   deal with
> >>>>   >       wait
> >>>>   >       >> fences in userspace?
> >>>>   >       >>
> >>>>   >       >
> >>>>   >       >My opinion is rework this but make the ordering via
> >>>>an engine
> >>>>   param
> >>>>   >       optional.
> >>>>   >       >
> >>>>   >       >e.g. A VM can be configured so all binds are ordered
> >>>>within the
> >>>>   VM
> >>>>   >       >
> >>>>   >       >e.g. A VM can be configured so all binds accept an engine
> >>>>   argument
> >>>>   >       (in
> >>>>   >       >the case of the i915 likely this is a gem context
> >>>>handle) and
> >>>>   binds
> >>>>   >       >ordered with respect to that engine.
> >>>>   >       >
> >>>>   >       >This gives UMDs options as the later likely consumes
> >>>>more KMD
> >>>>   >       resources
> >>>>   >       >so if a different UMD can live with binds being
> >>>>ordered within
> >>>>   the VM
> >>>>   >       >they can use a mode consuming less resources.
> >>>>   >       >
> >>>>   >
> >>>>   >       I think we need to be careful here if we are looking for
> some
> >>>>   out of
> >>>>   >       (submission) order completion of vm_bind/unbind.
> >>>>   >       In-order completion means, in a batch of binds and
> >>>>unbinds to be
> >>>>   >       completed in-order, user only needs to specify
> >>>>in-fence for the
> >>>>   >       first bind/unbind call and the our-fence for the last
> >>>>   bind/unbind
> >>>>   >       call. Also, the VA released by an unbind call can be
> >>>>re-used by
> >>>>   >       any subsequent bind call in that in-order batch.
> >>>>   >
> >>>>   >       These things will break if binding/unbinding were to
> >>>>be allowed
> >>>>   to
> >>>>   >       go out of order (of submission) and user need to be extra
> >>>>   careful
> >>>>   >       not to run into pre-mature triggereing of out-fence and bind
> >>>>   failing
> >>>>   >       as VA is still in use etc.
> >>>>   >
> >>>>   >       Also, VM_BIND binds the provided mapping on the specified
> >>>>   address
> >>>>   >       space
> >>>>   >       (VM). So, the uapi is not engine/context specific.
> >>>>   >
> >>>>   >       We can however add a 'queue' to the uapi which can be
> >>>>one from
> >>>>   the
> >>>>   >       pre-defined queues,
> >>>>   >       I915_VM_BIND_QUEUE_0
> >>>>   >       I915_VM_BIND_QUEUE_1
> >>>>   >       ...
> >>>>   >       I915_VM_BIND_QUEUE_(N-1)
> >>>>   >
> >>>>   >       KMD will spawn an async work queue for each queue which will
> >>>>   only
> >>>>   >       bind the mappings on that queue in the order of submission.
> >>>>   >       User can assign the queue to per engine or anything
> >>>>like that.
> >>>>   >
> >>>>   >       But again here, user need to be careful and not
> >>>>deadlock these
> >>>>   >       queues with circular dependency of fences.
> >>>>   >
> >>>>   >       I prefer adding this later an as extension based on
> >>>>whether it
> >>>>   >       is really helping with the implementation.
> >>>>   >
> >>>>   >     I can tell you right now that having everything on a single
> >>>>   in-order
> >>>>   >     queue will not get us the perf we want.  What vulkan
> >>>>really wants
> >>>>   is one
> >>>>   >     of two things:
> >>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen in
> >>>>   whatever
> >>>>   >     their dependencies are resolved and we ensure ordering
> >>>>ourselves
> >>>>   by
> >>>>   >     having a syncobj in the VkQueue.
> >>>>   >      2. The ability to create multiple VM_BIND queues.  We need at
> >>>>   least 2
> >>>>   >     but I don't see why there needs to be a limit besides
> >>>>the limits
> >>>>   the
> >>>>   >     i915 API already has on the number of engines.  Vulkan could
> >>>>   expose
> >>>>   >     multiple sparse binding queues to the client if it's not
> >>>>   arbitrarily
> >>>>   >     limited.
> >>>>
> >>>>   Thanks Jason, Lionel.
> >>>>
> >>>>   Jason, what are you referring to when you say "limits the i915 API
> >>>>   already
> >>>>   has on the number of engines"? I am not sure if there is such an
> uapi
> >>>>   today.
> >>>>
> >>>> There's a limit of something like 64 total engines today based on the
> >>>> number of bits we can cram into the exec flags in execbuffer2.  I
> think
> >>>> someone had an extended version that allowed more but I ripped it out
> >>>> because no one was using it.  Of course, execbuffer3 might not
> >>>>have that
> >>>> problem at all.
> >>>>
> >>>
> >>>Thanks Jason.
> >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
> >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
> >>>and somehow export it to user (I am thinking of embedding it in
> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
> >>>queues.
> >>
> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
> execbuf3
>

Yup!  That's exactly the limit I was talking about.


> >>will also have. So, we can simply define in vm_bind/unbind structures,
> >>
> >>#define I915_VM_BIND_MAX_QUEUE   64
> >>        __u32 queue;
> >>
> >>I think that will keep things simple.
> >
> >Hmmm? What does execbuf2 limit has to do with how many engines
> >hardware can have? I suggest not to do that.
> >
> >Change with added this:
> >
> >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >               return -EINVAL;
> >
> >To context creation needs to be undone and so let users create engine
> >maps with all hardware engines, and let execbuf3 access them all.
> >
>
> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
> Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
> make it N+1).
> But, as discussed in other thread of this RFC series, we are planning
> to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> any uapi that limits the number of engines (and hence the vm_bind queues
> need to be supported).
>
> If we leave the number of vm_bind queues to be arbitrarily large
> (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
> work_item and a linked list) lookup from the user specified queue index.
> Other option is to just put some hard limit (say 64 or 65) and use
> an array of queues in VM (each created upon first use). I prefer this.
>

I don't get why a VM_BIND queue is any different from any other queue or
userspace-visible kernel object.  But I'll leave those details up to danvet
or whoever else might be reviewing the implementation.

--Jason



>
> Niranjana
>
> >Regards,
> >
> >Tvrtko
> >
> >>
> >>Niranjana
> >>
> >>>
> >>>>   I am trying to see how many queues we need and don't want it to be
> >>>>   arbitrarily
> >>>>   large and unduely blow up memory usage and complexity in i915
> driver.
> >>>>
> >>>> I expect a Vulkan driver to use at most 2 in the vast majority
> >>>>of cases. I
> >>>> could imagine a client wanting to create more than 1 sparse
> >>>>queue in which
> >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
> >>>>goes, once
> >>>> you allow two, I don't think the complexity is going up by
> >>>>allowing N.  As
> >>>> for memory usage, creating more queues means more memory.  That's a
> >>>> trade-off that userspace can make.  Again, the expected number
> >>>>here is 1
> >>>> or 2 in the vast majority of cases so I don't think you need to worry.
> >>>
> >>>Ok, will start with n=3 meaning 8 queues.
> >>>That would require us create 8 workqueues.
> >>>We can change 'n' later if required.
> >>>
> >>>Niranjana
> >>>
> >>>>
> >>>>   >     Why?  Because Vulkan has two basic kind of bind
> >>>>operations and we
> >>>>   don't
> >>>>   >     want any dependencies between them:
> >>>>   >      1. Immediate.  These happen right after BO creation or
> >>>>maybe as
> >>>>   part of
> >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
> >>>>don't happen
> >>>>   on a
> >>>>   >     queue and we don't want them serialized with anything.  To
> >>>>   synchronize
> >>>>   >     with submit, we'll have a syncobj in the VkDevice which is
> >>>>   signaled by
> >>>>   >     all immediate bind operations and make submits wait on it.
> >>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
> the
> >>>>   same as
> >>>>   >     a render/compute queue or may be its own queue.  It's up to us
> >>>>   what we
> >>>>   >     want to advertise.  From the Vulkan API PoV, this is like any
> >>>>   other
> >>>>   >     queue.  Operations on it wait on and signal semaphores.  If we
> >>>>   have a
> >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
> >>>>signal just like
> >>>>   we do
> >>>>   >     in execbuf().
> >>>>   >     The important thing is that we don't want one type of
> >>>>operation to
> >>>>   block
> >>>>   >     on the other.  If immediate binds are blocking on sparse
> binds,
> >>>>   it's
> >>>>   >     going to cause over-synchronization issues.
> >>>>   >     In terms of the internal implementation, I know that
> >>>>there's going
> >>>>   to be
> >>>>   >     a lock on the VM and that we can't actually do these things in
> >>>>   >     parallel.  That's fine.  Once the dma_fences have signaled and
> >>>>   we're
> >>>>
> >>>>   Thats correct. It is like a single VM_BIND engine with
> >>>>multiple queues
> >>>>   feeding to it.
> >>>>
> >>>> Right.  As long as the queues themselves are independent and
> >>>>can block on
> >>>> dma_fences without holding up other queues, I think we're fine.
> >>>>
> >>>>   >     unblocked to do the bind operation, I don't care if
> >>>>there's a bit
> >>>>   of
> >>>>   >     synchronization due to locking.  That's expected.  What
> >>>>we can't
> >>>>   afford
> >>>>   >     to have is an immediate bind operation suddenly blocking on a
> >>>>   sparse
> >>>>   >     operation which is blocked on a compute job that's going to
> run
> >>>>   for
> >>>>   >     another 5ms.
> >>>>
> >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
> >>>>   VM_BIND
> >>>>   on other VMs. I am not sure about usecases here, but just wanted to
> >>>>   clarify.
> >>>>
> >>>> Yes, that's what I would expect.
> >>>> --Jason
> >>>>
> >>>>   Niranjana
> >>>>
> >>>>   >     For reference, Windows solves this by allowing arbitrarily
> many
> >>>>   paging
> >>>>   >     queues (what they call a VM_BIND engine/queue).  That
> >>>>design works
> >>>>   >     pretty well and solves the problems in question.
> >>>>Again, we could
> >>>>   just
> >>>>   >     make everything out-of-order and require using syncobjs
> >>>>to order
> >>>>   things
> >>>>   >     as userspace wants. That'd be fine too.
> >>>>   >     One more note while I'm here: danvet said something on
> >>>>IRC about
> >>>>   VM_BIND
> >>>>   >     queues waiting for syncobjs to materialize.  We don't really
> >>>>   want/need
> >>>>   >     this.  We already have all the machinery in userspace to
> handle
> >>>>   >     wait-before-signal and waiting for syncobj fences to
> >>>>materialize
> >>>>   and
> >>>>   >     that machinery is on by default.  It would actually
> >>>>take MORE work
> >>>>   in
> >>>>   >     Mesa to turn it off and take advantage of the kernel
> >>>>being able to
> >>>>   wait
> >>>>   >     for syncobjs to materialize.  Also, getting that right is
> >>>>   ridiculously
> >>>>   >     hard and I really don't want to get it wrong in kernel
> >>>>space.     When we
> >>>>   >     do memory fences, wait-before-signal will be a thing.  We
> don't
> >>>>   need to
> >>>>   >     try and make it a thing for syncobj.
> >>>>   >     --Jason
> >>>>   >
> >>>>   >   Thanks Jason,
> >>>>   >
> >>>>   >   I missed the bit in the Vulkan spec that we're allowed to have a
> >>>>   sparse
> >>>>   >   queue that does not implement either graphics or compute
> >>>>operations
> >>>>   :
> >>>>   >
> >>>>   >     "While some implementations may include
> >>>>   VK_QUEUE_SPARSE_BINDING_BIT
> >>>>   >     support in queue families that also include
> >>>>   >
> >>>>   >      graphics and compute support, other implementations may only
> >>>>   expose a
> >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>   >
> >>>>   >      family."
> >>>>   >
> >>>>   >   So it can all be all a vm_bind engine that just does bind/unbind
> >>>>   >   operations.
> >>>>   >
> >>>>   >   But yes we need another engine for the immediate/non-sparse
> >>>>   operations.
> >>>>   >
> >>>>   >   -Lionel
> >>>>   >
> >>>>   >         >
> >>>>   >       Daniel, any thoughts?
> >>>>   >
> >>>>   >       Niranjana
> >>>>   >
> >>>>   >       >Matt
> >>>>   >       >
> >>>>   >       >>
> >>>>   >       >> Sorry I noticed this late.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> -Lionel
> >>>>   >       >>
> >>>>   >       >>
>

[-- Attachment #2: Type: text/html, Size: 27898 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-08 21:55                           ` Jason Ekstrand
  0 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-08 21:55 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 17462 bytes --]

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >
> >
> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
> wrote:
> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>
> >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>   >
> >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
> >>>>   >
> >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>Brost wrote:
> >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
> >>>>   wrote:
> >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>   binding/unbinding
> >>>>   >       the mapping in an
> >>>>   >       >> > +async worker. The binding and unbinding will
> >>>>work like a
> >>>>   special
> >>>>   >       GPU engine.
> >>>>   >       >> > +The binding and unbinding operations are serialized
> and
> >>>>   will
> >>>>   >       wait on specified
> >>>>   >       >> > +input fences before the operation and will signal the
> >>>>   output
> >>>>   >       fences upon the
> >>>>   >       >> > +completion of the operation. Due to serialization,
> >>>>   completion of
> >>>>   >       an operation
> >>>>   >       >> > +will also indicate that all previous operations
> >>>>are also
> >>>>   >       complete.
> >>>>   >       >>
> >>>>   >       >> I guess we should avoid saying "will immediately start
> >>>>   >       binding/unbinding" if
> >>>>   >       >> there are fences involved.
> >>>>   >       >>
> >>>>   >       >> And the fact that it's happening in an async
> >>>>worker seem to
> >>>>   imply
> >>>>   >       it's not
> >>>>   >       >> immediate.
> >>>>   >       >>
> >>>>   >
> >>>>   >       Ok, will fix.
> >>>>   >       This was added because in earlier design binding was
> deferred
> >>>>   until
> >>>>   >       next execbuff.
> >>>>   >       But now it is non-deferred (immediate in that sense).
> >>>>But yah,
> >>>>   this is
> >>>>   >       confusing
> >>>>   >       and will fix it.
> >>>>   >
> >>>>   >       >>
> >>>>   >       >> I have a question on the behavior of the bind
> >>>>operation when
> >>>>   no
> >>>>   >       input fence
> >>>>   >       >> is provided. Let say I do :
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence1)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence2)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence3)
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> In what order are the fences going to be signaled?
> >>>>   >       >>
> >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
> >>>>   >       >>
> >>>>   >       >> Because you wrote "serialized I assume it's : in order
> >>>>   >       >>
> >>>>   >
> >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
> >>>>bind and
> >>>>   unbind
> >>>>   >       will use
> >>>>   >       the same queue and hence are ordered.
> >>>>   >
> >>>>   >       >>
> >>>>   >       >> One thing I didn't realize is that because we only get
> one
> >>>>   >       "VM_BIND" engine,
> >>>>   >       >> there is a disconnect from the Vulkan specification.
> >>>>   >       >>
> >>>>   >       >> In Vulkan VM_BIND operations are serialized but
> >>>>per engine.
> >>>>   >       >>
> >>>>   >       >> So you could have something like this :
> >>>>   >       >>
> >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> fence1 is not signaled
> >>>>   >       >>
> >>>>   >       >> fence3 is signaled
> >>>>   >       >>
> >>>>   >       >> So the second VM_BIND will proceed before the
> >>>>first VM_BIND.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> I guess we can deal with that scenario in
> >>>>userspace by doing
> >>>>   the
> >>>>   >       wait
> >>>>   >       >> ourselves in one thread per engines.
> >>>>   >       >>
> >>>>   >       >> But then it makes the VM_BIND input fences useless.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> Daniel : what do you think? Should be rework this or just
> >>>>   deal with
> >>>>   >       wait
> >>>>   >       >> fences in userspace?
> >>>>   >       >>
> >>>>   >       >
> >>>>   >       >My opinion is rework this but make the ordering via
> >>>>an engine
> >>>>   param
> >>>>   >       optional.
> >>>>   >       >
> >>>>   >       >e.g. A VM can be configured so all binds are ordered
> >>>>within the
> >>>>   VM
> >>>>   >       >
> >>>>   >       >e.g. A VM can be configured so all binds accept an engine
> >>>>   argument
> >>>>   >       (in
> >>>>   >       >the case of the i915 likely this is a gem context
> >>>>handle) and
> >>>>   binds
> >>>>   >       >ordered with respect to that engine.
> >>>>   >       >
> >>>>   >       >This gives UMDs options as the later likely consumes
> >>>>more KMD
> >>>>   >       resources
> >>>>   >       >so if a different UMD can live with binds being
> >>>>ordered within
> >>>>   the VM
> >>>>   >       >they can use a mode consuming less resources.
> >>>>   >       >
> >>>>   >
> >>>>   >       I think we need to be careful here if we are looking for
> some
> >>>>   out of
> >>>>   >       (submission) order completion of vm_bind/unbind.
> >>>>   >       In-order completion means, in a batch of binds and
> >>>>unbinds to be
> >>>>   >       completed in-order, user only needs to specify
> >>>>in-fence for the
> >>>>   >       first bind/unbind call and the our-fence for the last
> >>>>   bind/unbind
> >>>>   >       call. Also, the VA released by an unbind call can be
> >>>>re-used by
> >>>>   >       any subsequent bind call in that in-order batch.
> >>>>   >
> >>>>   >       These things will break if binding/unbinding were to
> >>>>be allowed
> >>>>   to
> >>>>   >       go out of order (of submission) and user need to be extra
> >>>>   careful
> >>>>   >       not to run into pre-mature triggereing of out-fence and bind
> >>>>   failing
> >>>>   >       as VA is still in use etc.
> >>>>   >
> >>>>   >       Also, VM_BIND binds the provided mapping on the specified
> >>>>   address
> >>>>   >       space
> >>>>   >       (VM). So, the uapi is not engine/context specific.
> >>>>   >
> >>>>   >       We can however add a 'queue' to the uapi which can be
> >>>>one from
> >>>>   the
> >>>>   >       pre-defined queues,
> >>>>   >       I915_VM_BIND_QUEUE_0
> >>>>   >       I915_VM_BIND_QUEUE_1
> >>>>   >       ...
> >>>>   >       I915_VM_BIND_QUEUE_(N-1)
> >>>>   >
> >>>>   >       KMD will spawn an async work queue for each queue which will
> >>>>   only
> >>>>   >       bind the mappings on that queue in the order of submission.
> >>>>   >       User can assign the queue to per engine or anything
> >>>>like that.
> >>>>   >
> >>>>   >       But again here, user need to be careful and not
> >>>>deadlock these
> >>>>   >       queues with circular dependency of fences.
> >>>>   >
> >>>>   >       I prefer adding this later an as extension based on
> >>>>whether it
> >>>>   >       is really helping with the implementation.
> >>>>   >
> >>>>   >     I can tell you right now that having everything on a single
> >>>>   in-order
> >>>>   >     queue will not get us the perf we want.  What vulkan
> >>>>really wants
> >>>>   is one
> >>>>   >     of two things:
> >>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen in
> >>>>   whatever
> >>>>   >     their dependencies are resolved and we ensure ordering
> >>>>ourselves
> >>>>   by
> >>>>   >     having a syncobj in the VkQueue.
> >>>>   >      2. The ability to create multiple VM_BIND queues.  We need at
> >>>>   least 2
> >>>>   >     but I don't see why there needs to be a limit besides
> >>>>the limits
> >>>>   the
> >>>>   >     i915 API already has on the number of engines.  Vulkan could
> >>>>   expose
> >>>>   >     multiple sparse binding queues to the client if it's not
> >>>>   arbitrarily
> >>>>   >     limited.
> >>>>
> >>>>   Thanks Jason, Lionel.
> >>>>
> >>>>   Jason, what are you referring to when you say "limits the i915 API
> >>>>   already
> >>>>   has on the number of engines"? I am not sure if there is such an
> uapi
> >>>>   today.
> >>>>
> >>>> There's a limit of something like 64 total engines today based on the
> >>>> number of bits we can cram into the exec flags in execbuffer2.  I
> think
> >>>> someone had an extended version that allowed more but I ripped it out
> >>>> because no one was using it.  Of course, execbuffer3 might not
> >>>>have that
> >>>> problem at all.
> >>>>
> >>>
> >>>Thanks Jason.
> >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
> >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
> >>>and somehow export it to user (I am thinking of embedding it in
> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
> >>>queues.
> >>
> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
> execbuf3
>

Yup!  That's exactly the limit I was talking about.


> >>will also have. So, we can simply define in vm_bind/unbind structures,
> >>
> >>#define I915_VM_BIND_MAX_QUEUE   64
> >>        __u32 queue;
> >>
> >>I think that will keep things simple.
> >
> >Hmmm? What does execbuf2 limit has to do with how many engines
> >hardware can have? I suggest not to do that.
> >
> >Change with added this:
> >
> >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >               return -EINVAL;
> >
> >To context creation needs to be undone and so let users create engine
> >maps with all hardware engines, and let execbuf3 access them all.
> >
>
> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
> Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
> make it N+1).
> But, as discussed in other thread of this RFC series, we are planning
> to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> any uapi that limits the number of engines (and hence the vm_bind queues
> need to be supported).
>
> If we leave the number of vm_bind queues to be arbitrarily large
> (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
> work_item and a linked list) lookup from the user specified queue index.
> Other option is to just put some hard limit (say 64 or 65) and use
> an array of queues in VM (each created upon first use). I prefer this.
>

I don't get why a VM_BIND queue is any different from any other queue or
userspace-visible kernel object.  But I'll leave those details up to danvet
or whoever else might be reviewing the implementation.

--Jason



>
> Niranjana
>
> >Regards,
> >
> >Tvrtko
> >
> >>
> >>Niranjana
> >>
> >>>
> >>>>   I am trying to see how many queues we need and don't want it to be
> >>>>   arbitrarily
> >>>>   large and unduely blow up memory usage and complexity in i915
> driver.
> >>>>
> >>>> I expect a Vulkan driver to use at most 2 in the vast majority
> >>>>of cases. I
> >>>> could imagine a client wanting to create more than 1 sparse
> >>>>queue in which
> >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
> >>>>goes, once
> >>>> you allow two, I don't think the complexity is going up by
> >>>>allowing N.  As
> >>>> for memory usage, creating more queues means more memory.  That's a
> >>>> trade-off that userspace can make.  Again, the expected number
> >>>>here is 1
> >>>> or 2 in the vast majority of cases so I don't think you need to worry.
> >>>
> >>>Ok, will start with n=3 meaning 8 queues.
> >>>That would require us create 8 workqueues.
> >>>We can change 'n' later if required.
> >>>
> >>>Niranjana
> >>>
> >>>>
> >>>>   >     Why?  Because Vulkan has two basic kind of bind
> >>>>operations and we
> >>>>   don't
> >>>>   >     want any dependencies between them:
> >>>>   >      1. Immediate.  These happen right after BO creation or
> >>>>maybe as
> >>>>   part of
> >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
> >>>>don't happen
> >>>>   on a
> >>>>   >     queue and we don't want them serialized with anything.  To
> >>>>   synchronize
> >>>>   >     with submit, we'll have a syncobj in the VkDevice which is
> >>>>   signaled by
> >>>>   >     all immediate bind operations and make submits wait on it.
> >>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
> the
> >>>>   same as
> >>>>   >     a render/compute queue or may be its own queue.  It's up to us
> >>>>   what we
> >>>>   >     want to advertise.  From the Vulkan API PoV, this is like any
> >>>>   other
> >>>>   >     queue.  Operations on it wait on and signal semaphores.  If we
> >>>>   have a
> >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
> >>>>signal just like
> >>>>   we do
> >>>>   >     in execbuf().
> >>>>   >     The important thing is that we don't want one type of
> >>>>operation to
> >>>>   block
> >>>>   >     on the other.  If immediate binds are blocking on sparse
> binds,
> >>>>   it's
> >>>>   >     going to cause over-synchronization issues.
> >>>>   >     In terms of the internal implementation, I know that
> >>>>there's going
> >>>>   to be
> >>>>   >     a lock on the VM and that we can't actually do these things in
> >>>>   >     parallel.  That's fine.  Once the dma_fences have signaled and
> >>>>   we're
> >>>>
> >>>>   Thats correct. It is like a single VM_BIND engine with
> >>>>multiple queues
> >>>>   feeding to it.
> >>>>
> >>>> Right.  As long as the queues themselves are independent and
> >>>>can block on
> >>>> dma_fences without holding up other queues, I think we're fine.
> >>>>
> >>>>   >     unblocked to do the bind operation, I don't care if
> >>>>there's a bit
> >>>>   of
> >>>>   >     synchronization due to locking.  That's expected.  What
> >>>>we can't
> >>>>   afford
> >>>>   >     to have is an immediate bind operation suddenly blocking on a
> >>>>   sparse
> >>>>   >     operation which is blocked on a compute job that's going to
> run
> >>>>   for
> >>>>   >     another 5ms.
> >>>>
> >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
> >>>>   VM_BIND
> >>>>   on other VMs. I am not sure about usecases here, but just wanted to
> >>>>   clarify.
> >>>>
> >>>> Yes, that's what I would expect.
> >>>> --Jason
> >>>>
> >>>>   Niranjana
> >>>>
> >>>>   >     For reference, Windows solves this by allowing arbitrarily
> many
> >>>>   paging
> >>>>   >     queues (what they call a VM_BIND engine/queue).  That
> >>>>design works
> >>>>   >     pretty well and solves the problems in question.
> >>>>Again, we could
> >>>>   just
> >>>>   >     make everything out-of-order and require using syncobjs
> >>>>to order
> >>>>   things
> >>>>   >     as userspace wants. That'd be fine too.
> >>>>   >     One more note while I'm here: danvet said something on
> >>>>IRC about
> >>>>   VM_BIND
> >>>>   >     queues waiting for syncobjs to materialize.  We don't really
> >>>>   want/need
> >>>>   >     this.  We already have all the machinery in userspace to
> handle
> >>>>   >     wait-before-signal and waiting for syncobj fences to
> >>>>materialize
> >>>>   and
> >>>>   >     that machinery is on by default.  It would actually
> >>>>take MORE work
> >>>>   in
> >>>>   >     Mesa to turn it off and take advantage of the kernel
> >>>>being able to
> >>>>   wait
> >>>>   >     for syncobjs to materialize.  Also, getting that right is
> >>>>   ridiculously
> >>>>   >     hard and I really don't want to get it wrong in kernel
> >>>>space.     When we
> >>>>   >     do memory fences, wait-before-signal will be a thing.  We
> don't
> >>>>   need to
> >>>>   >     try and make it a thing for syncobj.
> >>>>   >     --Jason
> >>>>   >
> >>>>   >   Thanks Jason,
> >>>>   >
> >>>>   >   I missed the bit in the Vulkan spec that we're allowed to have a
> >>>>   sparse
> >>>>   >   queue that does not implement either graphics or compute
> >>>>operations
> >>>>   :
> >>>>   >
> >>>>   >     "While some implementations may include
> >>>>   VK_QUEUE_SPARSE_BINDING_BIT
> >>>>   >     support in queue families that also include
> >>>>   >
> >>>>   >      graphics and compute support, other implementations may only
> >>>>   expose a
> >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>   >
> >>>>   >      family."
> >>>>   >
> >>>>   >   So it can all be all a vm_bind engine that just does bind/unbind
> >>>>   >   operations.
> >>>>   >
> >>>>   >   But yes we need another engine for the immediate/non-sparse
> >>>>   operations.
> >>>>   >
> >>>>   >   -Lionel
> >>>>   >
> >>>>   >         >
> >>>>   >       Daniel, any thoughts?
> >>>>   >
> >>>>   >       Niranjana
> >>>>   >
> >>>>   >       >Matt
> >>>>   >       >
> >>>>   >       >>
> >>>>   >       >> Sorry I noticed this late.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> -Lionel
> >>>>   >       >>
> >>>>   >       >>
>

[-- Attachment #2: Type: text/html, Size: 27898 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-08 21:55                           ` Jason Ekstrand
@ 2022-06-08 22:48                             ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 22:48 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Tvrtko Ursulin, Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König

On Wed, Jun 08, 2022 at 04:55:38PM -0500, Jason Ekstrand wrote:
>   On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>     >
>     >
>     >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>     >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
>     wrote:
>     >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>     >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>     >>>> <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>
>     >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>     wrote:
>     >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >>>>   >
>     >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>   >
>     >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>     >>>>Brost wrote:
>     >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>     Landwerlin
>     >>>>   wrote:
>     >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     >>>>   binding/unbinding
>     >>>>   >       the mapping in an
>     >>>>   >       >> > +async worker. The binding and unbinding will
>     >>>>work like a
>     >>>>   special
>     >>>>   >       GPU engine.
>     >>>>   >       >> > +The binding and unbinding operations are serialized
>     and
>     >>>>   will
>     >>>>   >       wait on specified
>     >>>>   >       >> > +input fences before the operation and will signal
>     the
>     >>>>   output
>     >>>>   >       fences upon the
>     >>>>   >       >> > +completion of the operation. Due to serialization,
>     >>>>   completion of
>     >>>>   >       an operation
>     >>>>   >       >> > +will also indicate that all previous operations
>     >>>>are also
>     >>>>   >       complete.
>     >>>>   >       >>
>     >>>>   >       >> I guess we should avoid saying "will immediately start
>     >>>>   >       binding/unbinding" if
>     >>>>   >       >> there are fences involved.
>     >>>>   >       >>
>     >>>>   >       >> And the fact that it's happening in an async
>     >>>>worker seem to
>     >>>>   imply
>     >>>>   >       it's not
>     >>>>   >       >> immediate.
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Ok, will fix.
>     >>>>   >       This was added because in earlier design binding was
>     deferred
>     >>>>   until
>     >>>>   >       next execbuff.
>     >>>>   >       But now it is non-deferred (immediate in that sense).
>     >>>>But yah,
>     >>>>   this is
>     >>>>   >       confusing
>     >>>>   >       and will fix it.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> I have a question on the behavior of the bind
>     >>>>operation when
>     >>>>   no
>     >>>>   >       input fence
>     >>>>   >       >> is provided. Let say I do :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence1)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence3)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> In what order are the fences going to be signaled?
>     >>>>   >       >>
>     >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>     >>>>   >       >>
>     >>>>   >       >> Because you wrote "serialized I assume it's : in order
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>     >>>>bind and
>     >>>>   unbind
>     >>>>   >       will use
>     >>>>   >       the same queue and hence are ordered.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> One thing I didn't realize is that because we only get
>     one
>     >>>>   >       "VM_BIND" engine,
>     >>>>   >       >> there is a disconnect from the Vulkan specification.
>     >>>>   >       >>
>     >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>     >>>>per engine.
>     >>>>   >       >>
>     >>>>   >       >> So you could have something like this :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>     out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>     out_fence=fence4)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> fence1 is not signaled
>     >>>>   >       >>
>     >>>>   >       >> fence3 is signaled
>     >>>>   >       >>
>     >>>>   >       >> So the second VM_BIND will proceed before the
>     >>>>first VM_BIND.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> I guess we can deal with that scenario in
>     >>>>userspace by doing
>     >>>>   the
>     >>>>   >       wait
>     >>>>   >       >> ourselves in one thread per engines.
>     >>>>   >       >>
>     >>>>   >       >> But then it makes the VM_BIND input fences useless.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> Daniel : what do you think? Should be rework this or
>     just
>     >>>>   deal with
>     >>>>   >       wait
>     >>>>   >       >> fences in userspace?
>     >>>>   >       >>
>     >>>>   >       >
>     >>>>   >       >My opinion is rework this but make the ordering via
>     >>>>an engine
>     >>>>   param
>     >>>>   >       optional.
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds are ordered
>     >>>>within the
>     >>>>   VM
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds accept an
>     engine
>     >>>>   argument
>     >>>>   >       (in
>     >>>>   >       >the case of the i915 likely this is a gem context
>     >>>>handle) and
>     >>>>   binds
>     >>>>   >       >ordered with respect to that engine.
>     >>>>   >       >
>     >>>>   >       >This gives UMDs options as the later likely consumes
>     >>>>more KMD
>     >>>>   >       resources
>     >>>>   >       >so if a different UMD can live with binds being
>     >>>>ordered within
>     >>>>   the VM
>     >>>>   >       >they can use a mode consuming less resources.
>     >>>>   >       >
>     >>>>   >
>     >>>>   >       I think we need to be careful here if we are looking for
>     some
>     >>>>   out of
>     >>>>   >       (submission) order completion of vm_bind/unbind.
>     >>>>   >       In-order completion means, in a batch of binds and
>     >>>>unbinds to be
>     >>>>   >       completed in-order, user only needs to specify
>     >>>>in-fence for the
>     >>>>   >       first bind/unbind call and the our-fence for the last
>     >>>>   bind/unbind
>     >>>>   >       call. Also, the VA released by an unbind call can be
>     >>>>re-used by
>     >>>>   >       any subsequent bind call in that in-order batch.
>     >>>>   >
>     >>>>   >       These things will break if binding/unbinding were to
>     >>>>be allowed
>     >>>>   to
>     >>>>   >       go out of order (of submission) and user need to be extra
>     >>>>   careful
>     >>>>   >       not to run into pre-mature triggereing of out-fence and
>     bind
>     >>>>   failing
>     >>>>   >       as VA is still in use etc.
>     >>>>   >
>     >>>>   >       Also, VM_BIND binds the provided mapping on the specified
>     >>>>   address
>     >>>>   >       space
>     >>>>   >       (VM). So, the uapi is not engine/context specific.
>     >>>>   >
>     >>>>   >       We can however add a 'queue' to the uapi which can be
>     >>>>one from
>     >>>>   the
>     >>>>   >       pre-defined queues,
>     >>>>   >       I915_VM_BIND_QUEUE_0
>     >>>>   >       I915_VM_BIND_QUEUE_1
>     >>>>   >       ...
>     >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>     >>>>   >
>     >>>>   >       KMD will spawn an async work queue for each queue which
>     will
>     >>>>   only
>     >>>>   >       bind the mappings on that queue in the order of
>     submission.
>     >>>>   >       User can assign the queue to per engine or anything
>     >>>>like that.
>     >>>>   >
>     >>>>   >       But again here, user need to be careful and not
>     >>>>deadlock these
>     >>>>   >       queues with circular dependency of fences.
>     >>>>   >
>     >>>>   >       I prefer adding this later an as extension based on
>     >>>>whether it
>     >>>>   >       is really helping with the implementation.
>     >>>>   >
>     >>>>   >     I can tell you right now that having everything on a single
>     >>>>   in-order
>     >>>>   >     queue will not get us the perf we want.  What vulkan
>     >>>>really wants
>     >>>>   is one
>     >>>>   >     of two things:
>     >>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen
>     in
>     >>>>   whatever
>     >>>>   >     their dependencies are resolved and we ensure ordering
>     >>>>ourselves
>     >>>>   by
>     >>>>   >     having a syncobj in the VkQueue.
>     >>>>   >      2. The ability to create multiple VM_BIND queues.  We need
>     at
>     >>>>   least 2
>     >>>>   >     but I don't see why there needs to be a limit besides
>     >>>>the limits
>     >>>>   the
>     >>>>   >     i915 API already has on the number of engines.  Vulkan
>     could
>     >>>>   expose
>     >>>>   >     multiple sparse binding queues to the client if it's not
>     >>>>   arbitrarily
>     >>>>   >     limited.
>     >>>>
>     >>>>   Thanks Jason, Lionel.
>     >>>>
>     >>>>   Jason, what are you referring to when you say "limits the i915
>     API
>     >>>>   already
>     >>>>   has on the number of engines"? I am not sure if there is such an
>     uapi
>     >>>>   today.
>     >>>>
>     >>>> There's a limit of something like 64 total engines today based on
>     the
>     >>>> number of bits we can cram into the exec flags in execbuffer2.  I
>     think
>     >>>> someone had an extended version that allowed more but I ripped it
>     out
>     >>>> because no one was using it.  Of course, execbuffer3 might not
>     >>>>have that
>     >>>> problem at all.
>     >>>>
>     >>>
>     >>>Thanks Jason.
>     >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>     probably
>     >>>will not have this limiation. So, we need to define a
>     VM_BIND_MAX_QUEUE
>     >>>and somehow export it to user (I am thinking of embedding it in
>     >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning
>     2^n
>     >>>queues.
>     >>
>     >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
>     execbuf3
>
>   Yup!  That's exactly the limit I was talking about.
>    
>
>     >>will also have. So, we can simply define in vm_bind/unbind structures,
>     >>
>     >>#define I915_VM_BIND_MAX_QUEUE   64
>     >>        __u32 queue;
>     >>
>     >>I think that will keep things simple.
>     >
>     >Hmmm? What does execbuf2 limit has to do with how many engines
>     >hardware can have? I suggest not to do that.
>     >
>     >Change with added this:
>     >
>     >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>     >               return -EINVAL;
>     >
>     >To context creation needs to be undone and so let users create engine
>     >maps with all hardware engines, and let execbuf3 access them all.
>     >
>
>     Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>     Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>     make it N+1).
>     But, as discussed in other thread of this RFC series, we are planning
>     to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>     any uapi that limits the number of engines (and hence the vm_bind queues
>     need to be supported).
>
>     If we leave the number of vm_bind queues to be arbitrarily large
>     (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>     work_item and a linked list) lookup from the user specified queue index.
>     Other option is to just put some hard limit (say 64 or 65) and use
>     an array of queues in VM (each created upon first use). I prefer this.
>
>   I don't get why a VM_BIND queue is any different from any other queue or
>   userspace-visible kernel object.  But I'll leave those details up to
>   danvet or whoever else might be reviewing the implementation.

In execbuff3, if the user specified execbuf3.engine_id is beyond the number of
available engines on the gem context, an error is returned to the user.
In VM_BIND case, not sure how to do that bound check on user specified queue_idx.

In any case, it is an implementation detail and we can use a hashmap for
the VM_BIND queues here (there might be a slight ioctl latency added due to
hash lookup, but in normal case, should be insignificant), which should be Ok.

Niranjana

>   --Jason
>    
>
>     Niranjana
>
>     >Regards,
>     >
>     >Tvrtko
>     >
>     >>
>     >>Niranjana
>     >>
>     >>>
>     >>>>   I am trying to see how many queues we need and don't want it to
>     be
>     >>>>   arbitrarily
>     >>>>   large and unduely blow up memory usage and complexity in i915
>     driver.
>     >>>>
>     >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>     >>>>of cases. I
>     >>>> could imagine a client wanting to create more than 1 sparse
>     >>>>queue in which
>     >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>     >>>>goes, once
>     >>>> you allow two, I don't think the complexity is going up by
>     >>>>allowing N.  As
>     >>>> for memory usage, creating more queues means more memory.  That's a
>     >>>> trade-off that userspace can make.  Again, the expected number
>     >>>>here is 1
>     >>>> or 2 in the vast majority of cases so I don't think you need to
>     worry.
>     >>>
>     >>>Ok, will start with n=3 meaning 8 queues.
>     >>>That would require us create 8 workqueues.
>     >>>We can change 'n' later if required.
>     >>>
>     >>>Niranjana
>     >>>
>     >>>>
>     >>>>   >     Why?  Because Vulkan has two basic kind of bind
>     >>>>operations and we
>     >>>>   don't
>     >>>>   >     want any dependencies between them:
>     >>>>   >      1. Immediate.  These happen right after BO creation or
>     >>>>maybe as
>     >>>>   part of
>     >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>     >>>>don't happen
>     >>>>   on a
>     >>>>   >     queue and we don't want them serialized with anything.  To
>     >>>>   synchronize
>     >>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>     >>>>   signaled by
>     >>>>   >     all immediate bind operations and make submits wait on it.
>     >>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
>     the
>     >>>>   same as
>     >>>>   >     a render/compute queue or may be its own queue.  It's up to
>     us
>     >>>>   what we
>     >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
>     any
>     >>>>   other
>     >>>>   >     queue.  Operations on it wait on and signal semaphores.  If
>     we
>     >>>>   have a
>     >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>     >>>>signal just like
>     >>>>   we do
>     >>>>   >     in execbuf().
>     >>>>   >     The important thing is that we don't want one type of
>     >>>>operation to
>     >>>>   block
>     >>>>   >     on the other.  If immediate binds are blocking on sparse
>     binds,
>     >>>>   it's
>     >>>>   >     going to cause over-synchronization issues.
>     >>>>   >     In terms of the internal implementation, I know that
>     >>>>there's going
>     >>>>   to be
>     >>>>   >     a lock on the VM and that we can't actually do these things
>     in
>     >>>>   >     parallel.  That's fine.  Once the dma_fences have signaled
>     and
>     >>>>   we're
>     >>>>
>     >>>>   Thats correct. It is like a single VM_BIND engine with
>     >>>>multiple queues
>     >>>>   feeding to it.
>     >>>>
>     >>>> Right.  As long as the queues themselves are independent and
>     >>>>can block on
>     >>>> dma_fences without holding up other queues, I think we're fine.
>     >>>>
>     >>>>   >     unblocked to do the bind operation, I don't care if
>     >>>>there's a bit
>     >>>>   of
>     >>>>   >     synchronization due to locking.  That's expected.  What
>     >>>>we can't
>     >>>>   afford
>     >>>>   >     to have is an immediate bind operation suddenly blocking on
>     a
>     >>>>   sparse
>     >>>>   >     operation which is blocked on a compute job that's going to
>     run
>     >>>>   for
>     >>>>   >     another 5ms.
>     >>>>
>     >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
>     the
>     >>>>   VM_BIND
>     >>>>   on other VMs. I am not sure about usecases here, but just wanted
>     to
>     >>>>   clarify.
>     >>>>
>     >>>> Yes, that's what I would expect.
>     >>>> --Jason
>     >>>>
>     >>>>   Niranjana
>     >>>>
>     >>>>   >     For reference, Windows solves this by allowing arbitrarily
>     many
>     >>>>   paging
>     >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>     >>>>design works
>     >>>>   >     pretty well and solves the problems in question. 
>     >>>>Again, we could
>     >>>>   just
>     >>>>   >     make everything out-of-order and require using syncobjs
>     >>>>to order
>     >>>>   things
>     >>>>   >     as userspace wants. That'd be fine too.
>     >>>>   >     One more note while I'm here: danvet said something on
>     >>>>IRC about
>     >>>>   VM_BIND
>     >>>>   >     queues waiting for syncobjs to materialize.  We don't
>     really
>     >>>>   want/need
>     >>>>   >     this.  We already have all the machinery in userspace to
>     handle
>     >>>>   >     wait-before-signal and waiting for syncobj fences to
>     >>>>materialize
>     >>>>   and
>     >>>>   >     that machinery is on by default.  It would actually
>     >>>>take MORE work
>     >>>>   in
>     >>>>   >     Mesa to turn it off and take advantage of the kernel
>     >>>>being able to
>     >>>>   wait
>     >>>>   >     for syncobjs to materialize.  Also, getting that right is
>     >>>>   ridiculously
>     >>>>   >     hard and I really don't want to get it wrong in kernel
>     >>>>space.     When we
>     >>>>   >     do memory fences, wait-before-signal will be a thing.  We
>     don't
>     >>>>   need to
>     >>>>   >     try and make it a thing for syncobj.
>     >>>>   >     --Jason
>     >>>>   >
>     >>>>   >   Thanks Jason,
>     >>>>   >
>     >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
>     have a
>     >>>>   sparse
>     >>>>   >   queue that does not implement either graphics or compute
>     >>>>operations
>     >>>>   :
>     >>>>   >
>     >>>>   >     "While some implementations may include
>     >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>     >>>>   >     support in queue families that also include
>     >>>>   >
>     >>>>   >      graphics and compute support, other implementations may
>     only
>     >>>>   expose a
>     >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >>>>   >
>     >>>>   >      family."
>     >>>>   >
>     >>>>   >   So it can all be all a vm_bind engine that just does
>     bind/unbind
>     >>>>   >   operations.
>     >>>>   >
>     >>>>   >   But yes we need another engine for the immediate/non-sparse
>     >>>>   operations.
>     >>>>   >
>     >>>>   >   -Lionel
>     >>>>   >
>     >>>>   >         >
>     >>>>   >       Daniel, any thoughts?
>     >>>>   >
>     >>>>   >       Niranjana
>     >>>>   >
>     >>>>   >       >Matt
>     >>>>   >       >
>     >>>>   >       >>
>     >>>>   >       >> Sorry I noticed this late.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> -Lionel
>     >>>>   >       >>
>     >>>>   >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-08 22:48                             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-08 22:48 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König

On Wed, Jun 08, 2022 at 04:55:38PM -0500, Jason Ekstrand wrote:
>   On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>     >
>     >
>     >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>     >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
>     wrote:
>     >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>     >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>     >>>> <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>
>     >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>     wrote:
>     >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >>>>   >
>     >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>   >
>     >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>     >>>>Brost wrote:
>     >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>     Landwerlin
>     >>>>   wrote:
>     >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     >>>>   binding/unbinding
>     >>>>   >       the mapping in an
>     >>>>   >       >> > +async worker. The binding and unbinding will
>     >>>>work like a
>     >>>>   special
>     >>>>   >       GPU engine.
>     >>>>   >       >> > +The binding and unbinding operations are serialized
>     and
>     >>>>   will
>     >>>>   >       wait on specified
>     >>>>   >       >> > +input fences before the operation and will signal
>     the
>     >>>>   output
>     >>>>   >       fences upon the
>     >>>>   >       >> > +completion of the operation. Due to serialization,
>     >>>>   completion of
>     >>>>   >       an operation
>     >>>>   >       >> > +will also indicate that all previous operations
>     >>>>are also
>     >>>>   >       complete.
>     >>>>   >       >>
>     >>>>   >       >> I guess we should avoid saying "will immediately start
>     >>>>   >       binding/unbinding" if
>     >>>>   >       >> there are fences involved.
>     >>>>   >       >>
>     >>>>   >       >> And the fact that it's happening in an async
>     >>>>worker seem to
>     >>>>   imply
>     >>>>   >       it's not
>     >>>>   >       >> immediate.
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Ok, will fix.
>     >>>>   >       This was added because in earlier design binding was
>     deferred
>     >>>>   until
>     >>>>   >       next execbuff.
>     >>>>   >       But now it is non-deferred (immediate in that sense).
>     >>>>But yah,
>     >>>>   this is
>     >>>>   >       confusing
>     >>>>   >       and will fix it.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> I have a question on the behavior of the bind
>     >>>>operation when
>     >>>>   no
>     >>>>   >       input fence
>     >>>>   >       >> is provided. Let say I do :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence1)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence3)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> In what order are the fences going to be signaled?
>     >>>>   >       >>
>     >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>     >>>>   >       >>
>     >>>>   >       >> Because you wrote "serialized I assume it's : in order
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>     >>>>bind and
>     >>>>   unbind
>     >>>>   >       will use
>     >>>>   >       the same queue and hence are ordered.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> One thing I didn't realize is that because we only get
>     one
>     >>>>   >       "VM_BIND" engine,
>     >>>>   >       >> there is a disconnect from the Vulkan specification.
>     >>>>   >       >>
>     >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>     >>>>per engine.
>     >>>>   >       >>
>     >>>>   >       >> So you could have something like this :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>     out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>     out_fence=fence4)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> fence1 is not signaled
>     >>>>   >       >>
>     >>>>   >       >> fence3 is signaled
>     >>>>   >       >>
>     >>>>   >       >> So the second VM_BIND will proceed before the
>     >>>>first VM_BIND.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> I guess we can deal with that scenario in
>     >>>>userspace by doing
>     >>>>   the
>     >>>>   >       wait
>     >>>>   >       >> ourselves in one thread per engines.
>     >>>>   >       >>
>     >>>>   >       >> But then it makes the VM_BIND input fences useless.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> Daniel : what do you think? Should be rework this or
>     just
>     >>>>   deal with
>     >>>>   >       wait
>     >>>>   >       >> fences in userspace?
>     >>>>   >       >>
>     >>>>   >       >
>     >>>>   >       >My opinion is rework this but make the ordering via
>     >>>>an engine
>     >>>>   param
>     >>>>   >       optional.
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds are ordered
>     >>>>within the
>     >>>>   VM
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds accept an
>     engine
>     >>>>   argument
>     >>>>   >       (in
>     >>>>   >       >the case of the i915 likely this is a gem context
>     >>>>handle) and
>     >>>>   binds
>     >>>>   >       >ordered with respect to that engine.
>     >>>>   >       >
>     >>>>   >       >This gives UMDs options as the later likely consumes
>     >>>>more KMD
>     >>>>   >       resources
>     >>>>   >       >so if a different UMD can live with binds being
>     >>>>ordered within
>     >>>>   the VM
>     >>>>   >       >they can use a mode consuming less resources.
>     >>>>   >       >
>     >>>>   >
>     >>>>   >       I think we need to be careful here if we are looking for
>     some
>     >>>>   out of
>     >>>>   >       (submission) order completion of vm_bind/unbind.
>     >>>>   >       In-order completion means, in a batch of binds and
>     >>>>unbinds to be
>     >>>>   >       completed in-order, user only needs to specify
>     >>>>in-fence for the
>     >>>>   >       first bind/unbind call and the our-fence for the last
>     >>>>   bind/unbind
>     >>>>   >       call. Also, the VA released by an unbind call can be
>     >>>>re-used by
>     >>>>   >       any subsequent bind call in that in-order batch.
>     >>>>   >
>     >>>>   >       These things will break if binding/unbinding were to
>     >>>>be allowed
>     >>>>   to
>     >>>>   >       go out of order (of submission) and user need to be extra
>     >>>>   careful
>     >>>>   >       not to run into pre-mature triggereing of out-fence and
>     bind
>     >>>>   failing
>     >>>>   >       as VA is still in use etc.
>     >>>>   >
>     >>>>   >       Also, VM_BIND binds the provided mapping on the specified
>     >>>>   address
>     >>>>   >       space
>     >>>>   >       (VM). So, the uapi is not engine/context specific.
>     >>>>   >
>     >>>>   >       We can however add a 'queue' to the uapi which can be
>     >>>>one from
>     >>>>   the
>     >>>>   >       pre-defined queues,
>     >>>>   >       I915_VM_BIND_QUEUE_0
>     >>>>   >       I915_VM_BIND_QUEUE_1
>     >>>>   >       ...
>     >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>     >>>>   >
>     >>>>   >       KMD will spawn an async work queue for each queue which
>     will
>     >>>>   only
>     >>>>   >       bind the mappings on that queue in the order of
>     submission.
>     >>>>   >       User can assign the queue to per engine or anything
>     >>>>like that.
>     >>>>   >
>     >>>>   >       But again here, user need to be careful and not
>     >>>>deadlock these
>     >>>>   >       queues with circular dependency of fences.
>     >>>>   >
>     >>>>   >       I prefer adding this later an as extension based on
>     >>>>whether it
>     >>>>   >       is really helping with the implementation.
>     >>>>   >
>     >>>>   >     I can tell you right now that having everything on a single
>     >>>>   in-order
>     >>>>   >     queue will not get us the perf we want.  What vulkan
>     >>>>really wants
>     >>>>   is one
>     >>>>   >     of two things:
>     >>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen
>     in
>     >>>>   whatever
>     >>>>   >     their dependencies are resolved and we ensure ordering
>     >>>>ourselves
>     >>>>   by
>     >>>>   >     having a syncobj in the VkQueue.
>     >>>>   >      2. The ability to create multiple VM_BIND queues.  We need
>     at
>     >>>>   least 2
>     >>>>   >     but I don't see why there needs to be a limit besides
>     >>>>the limits
>     >>>>   the
>     >>>>   >     i915 API already has on the number of engines.  Vulkan
>     could
>     >>>>   expose
>     >>>>   >     multiple sparse binding queues to the client if it's not
>     >>>>   arbitrarily
>     >>>>   >     limited.
>     >>>>
>     >>>>   Thanks Jason, Lionel.
>     >>>>
>     >>>>   Jason, what are you referring to when you say "limits the i915
>     API
>     >>>>   already
>     >>>>   has on the number of engines"? I am not sure if there is such an
>     uapi
>     >>>>   today.
>     >>>>
>     >>>> There's a limit of something like 64 total engines today based on
>     the
>     >>>> number of bits we can cram into the exec flags in execbuffer2.  I
>     think
>     >>>> someone had an extended version that allowed more but I ripped it
>     out
>     >>>> because no one was using it.  Of course, execbuffer3 might not
>     >>>>have that
>     >>>> problem at all.
>     >>>>
>     >>>
>     >>>Thanks Jason.
>     >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>     probably
>     >>>will not have this limiation. So, we need to define a
>     VM_BIND_MAX_QUEUE
>     >>>and somehow export it to user (I am thinking of embedding it in
>     >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning
>     2^n
>     >>>queues.
>     >>
>     >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
>     execbuf3
>
>   Yup!  That's exactly the limit I was talking about.
>    
>
>     >>will also have. So, we can simply define in vm_bind/unbind structures,
>     >>
>     >>#define I915_VM_BIND_MAX_QUEUE   64
>     >>        __u32 queue;
>     >>
>     >>I think that will keep things simple.
>     >
>     >Hmmm? What does execbuf2 limit has to do with how many engines
>     >hardware can have? I suggest not to do that.
>     >
>     >Change with added this:
>     >
>     >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>     >               return -EINVAL;
>     >
>     >To context creation needs to be undone and so let users create engine
>     >maps with all hardware engines, and let execbuf3 access them all.
>     >
>
>     Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>     Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>     make it N+1).
>     But, as discussed in other thread of this RFC series, we are planning
>     to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>     any uapi that limits the number of engines (and hence the vm_bind queues
>     need to be supported).
>
>     If we leave the number of vm_bind queues to be arbitrarily large
>     (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>     work_item and a linked list) lookup from the user specified queue index.
>     Other option is to just put some hard limit (say 64 or 65) and use
>     an array of queues in VM (each created upon first use). I prefer this.
>
>   I don't get why a VM_BIND queue is any different from any other queue or
>   userspace-visible kernel object.  But I'll leave those details up to
>   danvet or whoever else might be reviewing the implementation.

In execbuff3, if the user specified execbuf3.engine_id is beyond the number of
available engines on the gem context, an error is returned to the user.
In VM_BIND case, not sure how to do that bound check on user specified queue_idx.

In any case, it is an implementation detail and we can use a hashmap for
the VM_BIND queues here (there might be a slight ioctl latency added due to
hash lookup, but in normal case, should be insignificant), which should be Ok.

Niranjana

>   --Jason
>    
>
>     Niranjana
>
>     >Regards,
>     >
>     >Tvrtko
>     >
>     >>
>     >>Niranjana
>     >>
>     >>>
>     >>>>   I am trying to see how many queues we need and don't want it to
>     be
>     >>>>   arbitrarily
>     >>>>   large and unduely blow up memory usage and complexity in i915
>     driver.
>     >>>>
>     >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>     >>>>of cases. I
>     >>>> could imagine a client wanting to create more than 1 sparse
>     >>>>queue in which
>     >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>     >>>>goes, once
>     >>>> you allow two, I don't think the complexity is going up by
>     >>>>allowing N.  As
>     >>>> for memory usage, creating more queues means more memory.  That's a
>     >>>> trade-off that userspace can make.  Again, the expected number
>     >>>>here is 1
>     >>>> or 2 in the vast majority of cases so I don't think you need to
>     worry.
>     >>>
>     >>>Ok, will start with n=3 meaning 8 queues.
>     >>>That would require us create 8 workqueues.
>     >>>We can change 'n' later if required.
>     >>>
>     >>>Niranjana
>     >>>
>     >>>>
>     >>>>   >     Why?  Because Vulkan has two basic kind of bind
>     >>>>operations and we
>     >>>>   don't
>     >>>>   >     want any dependencies between them:
>     >>>>   >      1. Immediate.  These happen right after BO creation or
>     >>>>maybe as
>     >>>>   part of
>     >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>     >>>>don't happen
>     >>>>   on a
>     >>>>   >     queue and we don't want them serialized with anything.  To
>     >>>>   synchronize
>     >>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>     >>>>   signaled by
>     >>>>   >     all immediate bind operations and make submits wait on it.
>     >>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
>     the
>     >>>>   same as
>     >>>>   >     a render/compute queue or may be its own queue.  It's up to
>     us
>     >>>>   what we
>     >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
>     any
>     >>>>   other
>     >>>>   >     queue.  Operations on it wait on and signal semaphores.  If
>     we
>     >>>>   have a
>     >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>     >>>>signal just like
>     >>>>   we do
>     >>>>   >     in execbuf().
>     >>>>   >     The important thing is that we don't want one type of
>     >>>>operation to
>     >>>>   block
>     >>>>   >     on the other.  If immediate binds are blocking on sparse
>     binds,
>     >>>>   it's
>     >>>>   >     going to cause over-synchronization issues.
>     >>>>   >     In terms of the internal implementation, I know that
>     >>>>there's going
>     >>>>   to be
>     >>>>   >     a lock on the VM and that we can't actually do these things
>     in
>     >>>>   >     parallel.  That's fine.  Once the dma_fences have signaled
>     and
>     >>>>   we're
>     >>>>
>     >>>>   Thats correct. It is like a single VM_BIND engine with
>     >>>>multiple queues
>     >>>>   feeding to it.
>     >>>>
>     >>>> Right.  As long as the queues themselves are independent and
>     >>>>can block on
>     >>>> dma_fences without holding up other queues, I think we're fine.
>     >>>>
>     >>>>   >     unblocked to do the bind operation, I don't care if
>     >>>>there's a bit
>     >>>>   of
>     >>>>   >     synchronization due to locking.  That's expected.  What
>     >>>>we can't
>     >>>>   afford
>     >>>>   >     to have is an immediate bind operation suddenly blocking on
>     a
>     >>>>   sparse
>     >>>>   >     operation which is blocked on a compute job that's going to
>     run
>     >>>>   for
>     >>>>   >     another 5ms.
>     >>>>
>     >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
>     the
>     >>>>   VM_BIND
>     >>>>   on other VMs. I am not sure about usecases here, but just wanted
>     to
>     >>>>   clarify.
>     >>>>
>     >>>> Yes, that's what I would expect.
>     >>>> --Jason
>     >>>>
>     >>>>   Niranjana
>     >>>>
>     >>>>   >     For reference, Windows solves this by allowing arbitrarily
>     many
>     >>>>   paging
>     >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>     >>>>design works
>     >>>>   >     pretty well and solves the problems in question. 
>     >>>>Again, we could
>     >>>>   just
>     >>>>   >     make everything out-of-order and require using syncobjs
>     >>>>to order
>     >>>>   things
>     >>>>   >     as userspace wants. That'd be fine too.
>     >>>>   >     One more note while I'm here: danvet said something on
>     >>>>IRC about
>     >>>>   VM_BIND
>     >>>>   >     queues waiting for syncobjs to materialize.  We don't
>     really
>     >>>>   want/need
>     >>>>   >     this.  We already have all the machinery in userspace to
>     handle
>     >>>>   >     wait-before-signal and waiting for syncobj fences to
>     >>>>materialize
>     >>>>   and
>     >>>>   >     that machinery is on by default.  It would actually
>     >>>>take MORE work
>     >>>>   in
>     >>>>   >     Mesa to turn it off and take advantage of the kernel
>     >>>>being able to
>     >>>>   wait
>     >>>>   >     for syncobjs to materialize.  Also, getting that right is
>     >>>>   ridiculously
>     >>>>   >     hard and I really don't want to get it wrong in kernel
>     >>>>space.     When we
>     >>>>   >     do memory fences, wait-before-signal will be a thing.  We
>     don't
>     >>>>   need to
>     >>>>   >     try and make it a thing for syncobj.
>     >>>>   >     --Jason
>     >>>>   >
>     >>>>   >   Thanks Jason,
>     >>>>   >
>     >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
>     have a
>     >>>>   sparse
>     >>>>   >   queue that does not implement either graphics or compute
>     >>>>operations
>     >>>>   :
>     >>>>   >
>     >>>>   >     "While some implementations may include
>     >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>     >>>>   >     support in queue families that also include
>     >>>>   >
>     >>>>   >      graphics and compute support, other implementations may
>     only
>     >>>>   expose a
>     >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >>>>   >
>     >>>>   >      family."
>     >>>>   >
>     >>>>   >   So it can all be all a vm_bind engine that just does
>     bind/unbind
>     >>>>   >   operations.
>     >>>>   >
>     >>>>   >   But yes we need another engine for the immediate/non-sparse
>     >>>>   operations.
>     >>>>   >
>     >>>>   >   -Lionel
>     >>>>   >
>     >>>>   >         >
>     >>>>   >       Daniel, any thoughts?
>     >>>>   >
>     >>>>   >       Niranjana
>     >>>>   >
>     >>>>   >       >Matt
>     >>>>   >       >
>     >>>>   >       >>
>     >>>>   >       >> Sorry I noticed this late.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> -Lionel
>     >>>>   >       >>
>     >>>>   >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08 21:32             ` Niranjana Vishwanathapura
@ 2022-06-09  8:36               ` Matthew Auld
  -1 siblings, 0 replies; 121+ messages in thread
From: Matthew Auld @ 2022-06-09  8:36 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Tvrtko Ursulin, intel-gfx, chris.p.wilson, thomas.hellstrom,
	dri-devel, daniel.vetter, christian.koenig

On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
> On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>> On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>>
>>> On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>> VM_BIND and related uapi definitions
>>>>>>
>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>     from Daniel.
>>>>>>
>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>> ---
>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>> +++++++++++++++++++++++++++
>>>>>>  1 file changed, 399 insertions(+)
>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>
>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..589c0a009107
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> @@ -0,0 +1,399 @@
>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>> +/*
>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>> + */
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>> + *
>>>>>> + * VM_BIND feature availability.
>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>> + */
>>>>>> +#define I915_PARAM_HAS_VM_BIND        57
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>> + *
>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>> + *
>>>>>> + * A VM in VM_BIND mode will not support the older execbuff mode 
>>>>>> of binding.
>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>>>>> (ie., the
>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be 
>>>>>> provided
>>>>>> + * to pass in the batch buffer addresses.
>>>>>> + *
>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must 
>>>>>> be 0
>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must 
>>>>>> always be
>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>> batch_len fields
>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used and must 
>>>>>> be 0.
>>>>>> + */
>>>>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>> + *
>>>>>> + * Flag to declare context as long running.
>>>>>> + * See struct drm_i915_gem_context_create_ext flags.
>>>>>> + *
>>>>>> + * Usage of dma-fence expects that they complete in reasonable 
>>>>>> amount of time.
>>>>>> + * Compute on the other hand can be long running. Hence it is not 
>>>>>> appropriate
>>>>>> + * for compute contexts to export request completion dma-fence to 
>>>>>> user.
>>>>>> + * The dma-fence usage will be limited to in-kernel consumption 
>>>>>> only.
>>>>>> + * Compute contexts need to use user/memory fence.
>>>>>> + *
>>>>>> + * So, long running contexts do not support output fences. Hence,
>>>>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) 
>>>>>> are expected
>>>>>> + * to be not used.
>>>>>> + *
>>>>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects 
>>>>>> mapped
>>>>>> + * to long running contexts.
>>>>>> + */
>>>>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>> +
>>>>>> +/* VM_BIND related ioctls */
>>>>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>> +
>>>>>> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>>>> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>>>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>>>> DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>>>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct 
>>>>>> drm_i915_gem_wait_user_fence)
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>> + *
>>>>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>>>>> mapping of GPU
>>>>>> + * virtual address (VA) range to the section of an object that 
>>>>>> should be bound
>>>>>> + * in the device page table of the specified address space (VM).
>>>>>> + * The VA range specified must be unique (ie., not currently 
>>>>>> bound) and can
>>>>>> + * be mapped to whole object or a section of the object (partial 
>>>>>> binding).
>>>>>> + * Multiple VA mappings can be created to the same section of the 
>>>>>> object
>>>>>> + * (aliasing).
>>>>>> + */
>>>>>> +struct drm_i915_gem_vm_bind {
>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>> +    __u32 vm_id;
>>>>>> +
>>>>>> +    /** @handle: Object handle */
>>>>>> +    __u32 handle;
>>>>>> +
>>>>>> +    /** @start: Virtual Address start to bind */
>>>>>> +    __u64 start;
>>>>>> +
>>>>>> +    /** @offset: Offset in object to bind */
>>>>>> +    __u64 offset;
>>>>>> +
>>>>>> +    /** @length: Length of mapping to bind */
>>>>>> +    __u64 length;
>>>>>
>>>>> Does it support, or should it, equivalent of 
>>>>> EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the 
>>>>> remainder of the space to a dummy object? In which case would there 
>>>>> be any alignment/padding issues preventing the two bind to be 
>>>>> placed next to each other?
>>>>>
>>>>> I ask because someone from the compute side asked me about a 
>>>>> problem with their strategy of dealing with overfetch and I 
>>>>> suggested pad to size.
>>>>>
>>>>
>>>> Thanks Tvrtko,
>>>> I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>> is completely pushed to userspace, no padding should be necessary
>>>> once the 'start' and 'size' alignment conditions are met.
>>>>
>>>> I will add some documentation on alignment requirement here.
>>>> Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>> be 64K aligned.
>>>
>>> + Matt
>>>
>>> Align to 64k is enough for all overfetch issues?
>>>
>>> Apparently compute has a situation where a buffer is received by one 
>>> component and another has to apply more alignment to it, to deal with 
>>> overfetch. Since they cannot grow the actual BO if they wanted to 
>>> VM_BIND a scratch area on top? Or perhaps none of this is a problem 
>>> on discrete and original BO should be correctly allocated to start with.
>>>
>>> Side question - what about the align to 2MiB mentioned in 
>>> i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply 
>>> to discrete?
>>
>> Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>> require a minimum of 64K pages underneath for local memory, and the BO 
>> size will also be rounded up accordingly. And yeah the complication 
>> arises due to not being able to mix 4K + 64K GTT pages within the same 
>> page-table (existed since even gen8). Note that 4K here is what we 
>> typically get for system memory.
>>
>> Originally we had a memory coloring scheme to track the "color" of 
>> each page-table, which basically ensures that userspace can't do 
>> something nasty like mixing page sizes. The advantage of that scheme 
>> is that we would only require 64K GTT alignment and no extra padding, 
>> but is perhaps a little complex.
>>
>> The merged solution is just to align and pad (i.e vma->node.size and 
>> not vma->size) out of the vma to 2M, which is dead simple 
>> implementation wise, but does potentially waste some GTT space and 
>> some of the local memory used for the actual page-table. For the 
>> alignment the kernel just validates that the GTT address is aligned to 
>> 2M in vma_insert(), and then for the padding it just inflates it to 
>> 2M, if userspace hasn't already.
>>
>> See the kernel-doc for @size: 
>> https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext 
>>
>>
> 
> Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
> This is unfortunate, but it is not something new enforced by VM_BIND.
> Other option is to go with 64K alignment and in VM_BIND case, user
> must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
> mappings in the same 2M range. But this is not VM_BIND specific
> (will apply to soft-pinning in execbuf2 also).
> 
> I don't think we need any VA padding here as with VM_BIND VA is
> managed fully by the user. If we enforce VA to be 2M aligned, it
> will leave holes (if BOs are smaller then 2M), but nobody is going
> to allocate anything form there.

Note that we only apply the 2M alignment + padding for local memory 
pages, for system memory we don't have/need such restrictions. The VA 
padding then importantly prevents userspace from incorrectly (or 
maliciously) inserting 4K system memory object in some page-table 
operating in 64K GTT mode.

> 
> Niranjana
> 
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> Niranjana
>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_GEM_VM_BIND_READONLY:
>>>>>> +     * Mapping is read-only.
>>>>>> +     *
>>>>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>>>>> +     * Capture this mapping in the dump upon GPU error.
>>>>>> +     */
>>>>>> +    __u64 flags;
>>>>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>> +
>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>> mapping. */
>>>>>> +    __u64 extensions;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>> + *
>>>>>> + * This structure is passed to VM_UNBIND ioctl and specifies the 
>>>>>> GPU virtual
>>>>>> + * address (VA) range that should be unbound from the device page 
>>>>>> table of the
>>>>>> + * specified address space (VM). The specified VA range must 
>>>>>> match one of the
>>>>>> + * mappings created with the VM_BIND ioctl. TLB is flushed upon 
>>>>>> unbind
>>>>>> + * completion.
>>>>>> + */
>>>>>> +struct drm_i915_gem_vm_unbind {
>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>> +    __u32 vm_id;
>>>>>> +
>>>>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>>>>> +    __u32 rsvd;
>>>>>> +
>>>>>> +    /** @start: Virtual Address start to unbind */
>>>>>> +    __u64 start;
>>>>>> +
>>>>>> +    /** @length: Length of mapping to unbind */
>>>>>> +    __u64 length;
>>>>>> +
>>>>>> +    /** @flags: reserved for future usage, currently MBZ */
>>>>>> +    __u64 flags;
>>>>>> +
>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>> mapping. */
>>>>>> +    __u64 extensions;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_fence - An input or output fence for 
>>>>>> the vm_bind
>>>>>> + * or the vm_unbind work.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind aync worker will wait for input fence 
>>>>>> to signal
>>>>>> + * before starting the binding or unbinding.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind async worker will signal the returned 
>>>>>> output fence
>>>>>> + * after the completion of binding or unbinding.
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_fence {
>>>>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>>>>> signal. */
>>>>>> +    __u32 handle;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_FENCE_WAIT:
>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>> +     * Return bind/unbind completion fence as output
>>>>>> +     */
>>>>>> +    __u32 flags;
>>>>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences 
>>>>>> for vm_bind
>>>>>> + * and vm_unbind.
>>>>>> + *
>>>>>> + * This structure describes an array of timeline drm_syncobj and 
>>>>>> associated
>>>>>> + * points for timeline variants of drm_syncobj. These timeline 
>>>>>> 'drm_syncobj's
>>>>>> + * can be input or output fences (See struct 
>>>>>> drm_i915_vm_bind_fence).
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>>>>> @value_ptr
>>>>>> +     * arrays.
>>>>>> +     */
>>>>>> +    __u64 fence_count;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @handles_ptr: Pointer to an array of struct 
>>>>>> drm_i915_vm_bind_fence
>>>>>> +     * of length @fence_count.
>>>>>> +     */
>>>>>> +    __u64 handles_ptr;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>>>>> +     * @fence_count.
>>>>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>>>> +     * timeline drm_syncobj is invalid as it turns a drm_syncobj 
>>>>>> into a
>>>>>> +     * binary one.
>>>>>> +     */
>>>>>> +    __u64 values_ptr;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_user_fence - An input or output user 
>>>>>> fence for the
>>>>>> + * vm_bind or the vm_unbind work.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind aync worker will wait for the input 
>>>>>> fence (value at
>>>>>> + * @addr to become equal to @val) before starting the binding or 
>>>>>> unbinding.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind async worker will signal the output 
>>>>>> fence after
>>>>>> + * the completion of binding or unbinding by writing @val to 
>>>>>> memory location at
>>>>>> + * @addr
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_user_fence {
>>>>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>>>>> address */
>>>>>> +    __u64 addr;
>>>>>> +
>>>>>> +    /** @val: User/Memory fence value to be written after bind 
>>>>>> completion */
>>>>>> +    __u64 val;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>> +     * Return bind/unbind completion fence as output
>>>>>> +     */
>>>>>> +    __u32 flags;
>>>>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences 
>>>>>> for vm_bind
>>>>>> + * and vm_unbind.
>>>>>> + *
>>>>>> + * These user fences can be input or output fences
>>>>>> + * (See struct drm_i915_vm_bind_user_fence).
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_ext_user_fence {
>>>>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /** @fence_count: Number of elements in the @user_fence_ptr 
>>>>>> array. */
>>>>>> +    __u64 fence_count;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @user_fence_ptr: Pointer to an array of
>>>>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>> +     */
>>>>>> +    __u64 user_fence_ptr;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of 
>>>>>> batch buffer
>>>>>> + * gpu virtual addresses.
>>>>>> + *
>>>>>> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), 
>>>>>> this extension
>>>>>> + * must always be appended in the VM_BIND mode and it will be an 
>>>>>> error to
>>>>>> + * append this extension in older non-VM_BIND mode.
>>>>>> + */
>>>>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /** @count: Number of addresses in the addr array. */
>>>>>> +    __u32 count;
>>>>>> +
>>>>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>>>>> +    __u64 addr[0];
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level 
>>>>>> batch completion
>>>>>> + * signaling extension.
>>>>>> + *
>>>>>> + * This extension allows user to attach a user fence (@addr, 
>>>>>> @value pair) to an
>>>>>> + * execbuf to be signaled by the command streamer after the 
>>>>>> completion of first
>>>>>> + * level batch, by writing the @value at specified @addr and 
>>>>>> triggering an
>>>>>> + * interrupt.
>>>>>> + * User can either poll for this user fence to signal or can also 
>>>>>> wait on it
>>>>>> + * with i915_gem_wait_user_fence ioctl.
>>>>>> + * This is very much usefaul for long running contexts where 
>>>>>> waiting on dma-fence
>>>>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>>>>> + */
>>>>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>> +     *
>>>>>> +     * Address has to be a valid GPU virtual address at the time of
>>>>>> +     * first level batch completion.
>>>>>> +     */
>>>>>> +    __u64 addr;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @value: User/Memory fence Value to be written to above 
>>>>>> address
>>>>>> +     * after first level batch completes.
>>>>>> +     */
>>>>>> +    __u64 value;
>>>>>> +
>>>>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>> +    __u64 rsvd;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to make 
>>>>>> the object
>>>>>> + * private to the specified VM.
>>>>>> + *
>>>>>> + * See struct drm_i915_gem_create_ext.
>>>>>> + */
>>>>>> +struct drm_i915_gem_create_ext_vm_private {
>>>>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /** @vm_id: Id of the VM to which the object is private */
>>>>>> +    __u32 vm_id;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>>>> + *
>>>>>> + * User/Memory fence can be woken up either by:
>>>>>> + *
>>>>>> + * 1. GPU context indicated by @ctx_id, or,
>>>>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>> + *    @ctx_id is ignored when this flag is set.
>>>>>> + *
>>>>>> + * Wakeup condition is,
>>>>>> + * ``((*addr & mask) op (value & mask))``
>>>>>> + *
>>>>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>>> <indefinite_dma_fences>`
>>>>>> + */
>>>>>> +struct drm_i915_gem_wait_user_fence {
>>>>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>>>>> +    __u64 extensions;
>>>>>> +
>>>>>> +    /** @addr: User/Memory fence address */
>>>>>> +    __u64 addr;
>>>>>> +
>>>>>> +    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>> +    __u32 ctx_id;
>>>>>> +
>>>>>> +    /** @op: Wakeup condition operator */
>>>>>> +    __u16 op;
>>>>>> +#define I915_UFENCE_WAIT_EQ      0
>>>>>> +#define I915_UFENCE_WAIT_NEQ     1
>>>>>> +#define I915_UFENCE_WAIT_GT      2
>>>>>> +#define I915_UFENCE_WAIT_GTE     3
>>>>>> +#define I915_UFENCE_WAIT_LT      4
>>>>>> +#define I915_UFENCE_WAIT_LTE     5
>>>>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>>>>> +#define I915_UFENCE_WAIT_AFTER   7
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_UFENCE_WAIT_SOFT:
>>>>>> +     *
>>>>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>>>>> +     *
>>>>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>>>>> +     *
>>>>>> +     * Wait timeout specified as absolute time.
>>>>>> +     */
>>>>>> +    __u16 flags;
>>>>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>> +
>>>>>> +    /** @value: Wakeup value */
>>>>>> +    __u64 value;
>>>>>> +
>>>>>> +    /** @mask: Wakeup mask */
>>>>>> +    __u64 mask;
>>>>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>>>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @timeout: Wait timeout in nanoseconds.
>>>>>> +     *
>>>>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout 
>>>>>> is the
>>>>>> +     * absolute time in nsec.
>>>>>> +     */
>>>>>> +    __s64 timeout;
>>>>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-09  8:36               ` Matthew Auld
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Auld @ 2022-06-09  8:36 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
> On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>> On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>>
>>> On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>> VM_BIND and related uapi definitions
>>>>>>
>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>     from Daniel.
>>>>>>
>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>> ---
>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>> +++++++++++++++++++++++++++
>>>>>>  1 file changed, 399 insertions(+)
>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>
>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..589c0a009107
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> @@ -0,0 +1,399 @@
>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>> +/*
>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>> + */
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>> + *
>>>>>> + * VM_BIND feature availability.
>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>> + */
>>>>>> +#define I915_PARAM_HAS_VM_BIND        57
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>> + *
>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>> + *
>>>>>> + * A VM in VM_BIND mode will not support the older execbuff mode 
>>>>>> of binding.
>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>>>>> (ie., the
>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be 
>>>>>> provided
>>>>>> + * to pass in the batch buffer addresses.
>>>>>> + *
>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must 
>>>>>> be 0
>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must 
>>>>>> always be
>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>> batch_len fields
>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used and must 
>>>>>> be 0.
>>>>>> + */
>>>>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>> + *
>>>>>> + * Flag to declare context as long running.
>>>>>> + * See struct drm_i915_gem_context_create_ext flags.
>>>>>> + *
>>>>>> + * Usage of dma-fence expects that they complete in reasonable 
>>>>>> amount of time.
>>>>>> + * Compute on the other hand can be long running. Hence it is not 
>>>>>> appropriate
>>>>>> + * for compute contexts to export request completion dma-fence to 
>>>>>> user.
>>>>>> + * The dma-fence usage will be limited to in-kernel consumption 
>>>>>> only.
>>>>>> + * Compute contexts need to use user/memory fence.
>>>>>> + *
>>>>>> + * So, long running contexts do not support output fences. Hence,
>>>>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) 
>>>>>> are expected
>>>>>> + * to be not used.
>>>>>> + *
>>>>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects 
>>>>>> mapped
>>>>>> + * to long running contexts.
>>>>>> + */
>>>>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>> +
>>>>>> +/* VM_BIND related ioctls */
>>>>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>> +
>>>>>> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>>>> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>>>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>>>> DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>>>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct 
>>>>>> drm_i915_gem_wait_user_fence)
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>> + *
>>>>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>>>>> mapping of GPU
>>>>>> + * virtual address (VA) range to the section of an object that 
>>>>>> should be bound
>>>>>> + * in the device page table of the specified address space (VM).
>>>>>> + * The VA range specified must be unique (ie., not currently 
>>>>>> bound) and can
>>>>>> + * be mapped to whole object or a section of the object (partial 
>>>>>> binding).
>>>>>> + * Multiple VA mappings can be created to the same section of the 
>>>>>> object
>>>>>> + * (aliasing).
>>>>>> + */
>>>>>> +struct drm_i915_gem_vm_bind {
>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>> +    __u32 vm_id;
>>>>>> +
>>>>>> +    /** @handle: Object handle */
>>>>>> +    __u32 handle;
>>>>>> +
>>>>>> +    /** @start: Virtual Address start to bind */
>>>>>> +    __u64 start;
>>>>>> +
>>>>>> +    /** @offset: Offset in object to bind */
>>>>>> +    __u64 offset;
>>>>>> +
>>>>>> +    /** @length: Length of mapping to bind */
>>>>>> +    __u64 length;
>>>>>
>>>>> Does it support, or should it, equivalent of 
>>>>> EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the 
>>>>> remainder of the space to a dummy object? In which case would there 
>>>>> be any alignment/padding issues preventing the two bind to be 
>>>>> placed next to each other?
>>>>>
>>>>> I ask because someone from the compute side asked me about a 
>>>>> problem with their strategy of dealing with overfetch and I 
>>>>> suggested pad to size.
>>>>>
>>>>
>>>> Thanks Tvrtko,
>>>> I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>> is completely pushed to userspace, no padding should be necessary
>>>> once the 'start' and 'size' alignment conditions are met.
>>>>
>>>> I will add some documentation on alignment requirement here.
>>>> Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>> be 64K aligned.
>>>
>>> + Matt
>>>
>>> Align to 64k is enough for all overfetch issues?
>>>
>>> Apparently compute has a situation where a buffer is received by one 
>>> component and another has to apply more alignment to it, to deal with 
>>> overfetch. Since they cannot grow the actual BO if they wanted to 
>>> VM_BIND a scratch area on top? Or perhaps none of this is a problem 
>>> on discrete and original BO should be correctly allocated to start with.
>>>
>>> Side question - what about the align to 2MiB mentioned in 
>>> i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply 
>>> to discrete?
>>
>> Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>> require a minimum of 64K pages underneath for local memory, and the BO 
>> size will also be rounded up accordingly. And yeah the complication 
>> arises due to not being able to mix 4K + 64K GTT pages within the same 
>> page-table (existed since even gen8). Note that 4K here is what we 
>> typically get for system memory.
>>
>> Originally we had a memory coloring scheme to track the "color" of 
>> each page-table, which basically ensures that userspace can't do 
>> something nasty like mixing page sizes. The advantage of that scheme 
>> is that we would only require 64K GTT alignment and no extra padding, 
>> but is perhaps a little complex.
>>
>> The merged solution is just to align and pad (i.e vma->node.size and 
>> not vma->size) out of the vma to 2M, which is dead simple 
>> implementation wise, but does potentially waste some GTT space and 
>> some of the local memory used for the actual page-table. For the 
>> alignment the kernel just validates that the GTT address is aligned to 
>> 2M in vma_insert(), and then for the padding it just inflates it to 
>> 2M, if userspace hasn't already.
>>
>> See the kernel-doc for @size: 
>> https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext 
>>
>>
> 
> Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
> This is unfortunate, but it is not something new enforced by VM_BIND.
> Other option is to go with 64K alignment and in VM_BIND case, user
> must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
> mappings in the same 2M range. But this is not VM_BIND specific
> (will apply to soft-pinning in execbuf2 also).
> 
> I don't think we need any VA padding here as with VM_BIND VA is
> managed fully by the user. If we enforce VA to be 2M aligned, it
> will leave holes (if BOs are smaller then 2M), but nobody is going
> to allocate anything form there.

Note that we only apply the 2M alignment + padding for local memory 
pages, for system memory we don't have/need such restrictions. The VA 
padding then importantly prevents userspace from incorrectly (or 
maliciously) inserting 4K system memory object in some page-table 
operating in 64K GTT mode.

> 
> Niranjana
> 
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> Niranjana
>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_GEM_VM_BIND_READONLY:
>>>>>> +     * Mapping is read-only.
>>>>>> +     *
>>>>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>>>>> +     * Capture this mapping in the dump upon GPU error.
>>>>>> +     */
>>>>>> +    __u64 flags;
>>>>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>> +
>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>> mapping. */
>>>>>> +    __u64 extensions;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>> + *
>>>>>> + * This structure is passed to VM_UNBIND ioctl and specifies the 
>>>>>> GPU virtual
>>>>>> + * address (VA) range that should be unbound from the device page 
>>>>>> table of the
>>>>>> + * specified address space (VM). The specified VA range must 
>>>>>> match one of the
>>>>>> + * mappings created with the VM_BIND ioctl. TLB is flushed upon 
>>>>>> unbind
>>>>>> + * completion.
>>>>>> + */
>>>>>> +struct drm_i915_gem_vm_unbind {
>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>> +    __u32 vm_id;
>>>>>> +
>>>>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>>>>> +    __u32 rsvd;
>>>>>> +
>>>>>> +    /** @start: Virtual Address start to unbind */
>>>>>> +    __u64 start;
>>>>>> +
>>>>>> +    /** @length: Length of mapping to unbind */
>>>>>> +    __u64 length;
>>>>>> +
>>>>>> +    /** @flags: reserved for future usage, currently MBZ */
>>>>>> +    __u64 flags;
>>>>>> +
>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>> mapping. */
>>>>>> +    __u64 extensions;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_fence - An input or output fence for 
>>>>>> the vm_bind
>>>>>> + * or the vm_unbind work.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind aync worker will wait for input fence 
>>>>>> to signal
>>>>>> + * before starting the binding or unbinding.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind async worker will signal the returned 
>>>>>> output fence
>>>>>> + * after the completion of binding or unbinding.
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_fence {
>>>>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>>>>> signal. */
>>>>>> +    __u32 handle;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_FENCE_WAIT:
>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>> +     * Return bind/unbind completion fence as output
>>>>>> +     */
>>>>>> +    __u32 flags;
>>>>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences 
>>>>>> for vm_bind
>>>>>> + * and vm_unbind.
>>>>>> + *
>>>>>> + * This structure describes an array of timeline drm_syncobj and 
>>>>>> associated
>>>>>> + * points for timeline variants of drm_syncobj. These timeline 
>>>>>> 'drm_syncobj's
>>>>>> + * can be input or output fences (See struct 
>>>>>> drm_i915_vm_bind_fence).
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>>>>> @value_ptr
>>>>>> +     * arrays.
>>>>>> +     */
>>>>>> +    __u64 fence_count;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @handles_ptr: Pointer to an array of struct 
>>>>>> drm_i915_vm_bind_fence
>>>>>> +     * of length @fence_count.
>>>>>> +     */
>>>>>> +    __u64 handles_ptr;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>>>>> +     * @fence_count.
>>>>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>>>> +     * timeline drm_syncobj is invalid as it turns a drm_syncobj 
>>>>>> into a
>>>>>> +     * binary one.
>>>>>> +     */
>>>>>> +    __u64 values_ptr;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_user_fence - An input or output user 
>>>>>> fence for the
>>>>>> + * vm_bind or the vm_unbind work.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind aync worker will wait for the input 
>>>>>> fence (value at
>>>>>> + * @addr to become equal to @val) before starting the binding or 
>>>>>> unbinding.
>>>>>> + *
>>>>>> + * The vm_bind or vm_unbind async worker will signal the output 
>>>>>> fence after
>>>>>> + * the completion of binding or unbinding by writing @val to 
>>>>>> memory location at
>>>>>> + * @addr
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_user_fence {
>>>>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>>>>> address */
>>>>>> +    __u64 addr;
>>>>>> +
>>>>>> +    /** @val: User/Memory fence value to be written after bind 
>>>>>> completion */
>>>>>> +    __u64 val;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>> +     *
>>>>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>> +     * Return bind/unbind completion fence as output
>>>>>> +     */
>>>>>> +    __u32 flags;
>>>>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences 
>>>>>> for vm_bind
>>>>>> + * and vm_unbind.
>>>>>> + *
>>>>>> + * These user fences can be input or output fences
>>>>>> + * (See struct drm_i915_vm_bind_user_fence).
>>>>>> + */
>>>>>> +struct drm_i915_vm_bind_ext_user_fence {
>>>>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /** @fence_count: Number of elements in the @user_fence_ptr 
>>>>>> array. */
>>>>>> +    __u64 fence_count;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @user_fence_ptr: Pointer to an array of
>>>>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>> +     */
>>>>>> +    __u64 user_fence_ptr;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of 
>>>>>> batch buffer
>>>>>> + * gpu virtual addresses.
>>>>>> + *
>>>>>> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), 
>>>>>> this extension
>>>>>> + * must always be appended in the VM_BIND mode and it will be an 
>>>>>> error to
>>>>>> + * append this extension in older non-VM_BIND mode.
>>>>>> + */
>>>>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /** @count: Number of addresses in the addr array. */
>>>>>> +    __u32 count;
>>>>>> +
>>>>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>>>>> +    __u64 addr[0];
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level 
>>>>>> batch completion
>>>>>> + * signaling extension.
>>>>>> + *
>>>>>> + * This extension allows user to attach a user fence (@addr, 
>>>>>> @value pair) to an
>>>>>> + * execbuf to be signaled by the command streamer after the 
>>>>>> completion of first
>>>>>> + * level batch, by writing the @value at specified @addr and 
>>>>>> triggering an
>>>>>> + * interrupt.
>>>>>> + * User can either poll for this user fence to signal or can also 
>>>>>> wait on it
>>>>>> + * with i915_gem_wait_user_fence ioctl.
>>>>>> + * This is very much usefaul for long running contexts where 
>>>>>> waiting on dma-fence
>>>>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>>>>> + */
>>>>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>> +     *
>>>>>> +     * Address has to be a valid GPU virtual address at the time of
>>>>>> +     * first level batch completion.
>>>>>> +     */
>>>>>> +    __u64 addr;
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @value: User/Memory fence Value to be written to above 
>>>>>> address
>>>>>> +     * after first level batch completes.
>>>>>> +     */
>>>>>> +    __u64 value;
>>>>>> +
>>>>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>> +    __u64 rsvd;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to make 
>>>>>> the object
>>>>>> + * private to the specified VM.
>>>>>> + *
>>>>>> + * See struct drm_i915_gem_create_ext.
>>>>>> + */
>>>>>> +struct drm_i915_gem_create_ext_vm_private {
>>>>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>> +    struct i915_user_extension base;
>>>>>> +
>>>>>> +    /** @vm_id: Id of the VM to which the object is private */
>>>>>> +    __u32 vm_id;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>>>> + *
>>>>>> + * User/Memory fence can be woken up either by:
>>>>>> + *
>>>>>> + * 1. GPU context indicated by @ctx_id, or,
>>>>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>> + *    @ctx_id is ignored when this flag is set.
>>>>>> + *
>>>>>> + * Wakeup condition is,
>>>>>> + * ``((*addr & mask) op (value & mask))``
>>>>>> + *
>>>>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>>> <indefinite_dma_fences>`
>>>>>> + */
>>>>>> +struct drm_i915_gem_wait_user_fence {
>>>>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>>>>> +    __u64 extensions;
>>>>>> +
>>>>>> +    /** @addr: User/Memory fence address */
>>>>>> +    __u64 addr;
>>>>>> +
>>>>>> +    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>> +    __u32 ctx_id;
>>>>>> +
>>>>>> +    /** @op: Wakeup condition operator */
>>>>>> +    __u16 op;
>>>>>> +#define I915_UFENCE_WAIT_EQ      0
>>>>>> +#define I915_UFENCE_WAIT_NEQ     1
>>>>>> +#define I915_UFENCE_WAIT_GT      2
>>>>>> +#define I915_UFENCE_WAIT_GTE     3
>>>>>> +#define I915_UFENCE_WAIT_LT      4
>>>>>> +#define I915_UFENCE_WAIT_LTE     5
>>>>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>>>>> +#define I915_UFENCE_WAIT_AFTER   7
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @flags: Supported flags are,
>>>>>> +     *
>>>>>> +     * I915_UFENCE_WAIT_SOFT:
>>>>>> +     *
>>>>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>>>>> +     *
>>>>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>>>>> +     *
>>>>>> +     * Wait timeout specified as absolute time.
>>>>>> +     */
>>>>>> +    __u16 flags;
>>>>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>> +
>>>>>> +    /** @value: Wakeup value */
>>>>>> +    __u64 value;
>>>>>> +
>>>>>> +    /** @mask: Wakeup mask */
>>>>>> +    __u64 mask;
>>>>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>>>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>> +
>>>>>> +    /**
>>>>>> +     * @timeout: Wait timeout in nanoseconds.
>>>>>> +     *
>>>>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout 
>>>>>> is the
>>>>>> +     * absolute time in nsec.
>>>>>> +     */
>>>>>> +    __s64 timeout;
>>>>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-08 21:55                           ` Jason Ekstrand
  (?)
  (?)
@ 2022-06-09 14:49                           ` Lionel Landwerlin
  2022-06-09 19:31                               ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-09 14:49 UTC (permalink / raw)
  To: Jason Ekstrand, Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

[-- Attachment #1: Type: text/plain, Size: 21791 bytes --]

On 09/06/2022 00:55, Jason Ekstrand wrote:
> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura 
> <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>     >
>     >
>     >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>     >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>     Vishwanathapura wrote:
>     >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>     >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>     >>>> <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>
>     >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>     wrote:
>     >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >>>>   >
>     >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>   >
>     >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>     >>>>Brost wrote:
>     >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>     Landwerlin
>     >>>>   wrote:
>     >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>     wrote:
>     >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     >>>>   binding/unbinding
>     >>>>   >       the mapping in an
>     >>>>   >       >> > +async worker. The binding and unbinding will
>     >>>>work like a
>     >>>>   special
>     >>>>   >       GPU engine.
>     >>>>   >       >> > +The binding and unbinding operations are
>     serialized and
>     >>>>   will
>     >>>>   >       wait on specified
>     >>>>   >       >> > +input fences before the operation and will
>     signal the
>     >>>>   output
>     >>>>   >       fences upon the
>     >>>>   >       >> > +completion of the operation. Due to
>     serialization,
>     >>>>   completion of
>     >>>>   >       an operation
>     >>>>   >       >> > +will also indicate that all previous operations
>     >>>>are also
>     >>>>   >       complete.
>     >>>>   >       >>
>     >>>>   >       >> I guess we should avoid saying "will immediately
>     start
>     >>>>   >       binding/unbinding" if
>     >>>>   >       >> there are fences involved.
>     >>>>   >       >>
>     >>>>   >       >> And the fact that it's happening in an async
>     >>>>worker seem to
>     >>>>   imply
>     >>>>   >       it's not
>     >>>>   >       >> immediate.
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Ok, will fix.
>     >>>>   >       This was added because in earlier design binding
>     was deferred
>     >>>>   until
>     >>>>   >       next execbuff.
>     >>>>   >       But now it is non-deferred (immediate in that sense).
>     >>>>But yah,
>     >>>>   this is
>     >>>>   >       confusing
>     >>>>   >       and will fix it.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> I have a question on the behavior of the bind
>     >>>>operation when
>     >>>>   no
>     >>>>   >       input fence
>     >>>>   >       >> is provided. Let say I do :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence1)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence3)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> In what order are the fences going to be signaled?
>     >>>>   >       >>
>     >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>     >>>>   >       >>
>     >>>>   >       >> Because you wrote "serialized I assume it's : in
>     order
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>     >>>>bind and
>     >>>>   unbind
>     >>>>   >       will use
>     >>>>   >       the same queue and hence are ordered.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> One thing I didn't realize is that because we
>     only get one
>     >>>>   >       "VM_BIND" engine,
>     >>>>   >       >> there is a disconnect from the Vulkan specification.
>     >>>>   >       >>
>     >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>     >>>>per engine.
>     >>>>   >       >>
>     >>>>   >       >> So you could have something like this :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>     out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>     out_fence=fence4)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> fence1 is not signaled
>     >>>>   >       >>
>     >>>>   >       >> fence3 is signaled
>     >>>>   >       >>
>     >>>>   >       >> So the second VM_BIND will proceed before the
>     >>>>first VM_BIND.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> I guess we can deal with that scenario in
>     >>>>userspace by doing
>     >>>>   the
>     >>>>   >       wait
>     >>>>   >       >> ourselves in one thread per engines.
>     >>>>   >       >>
>     >>>>   >       >> But then it makes the VM_BIND input fences useless.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> Daniel : what do you think? Should be rework
>     this or just
>     >>>>   deal with
>     >>>>   >       wait
>     >>>>   >       >> fences in userspace?
>     >>>>   >       >>
>     >>>>   >       >
>     >>>>   >       >My opinion is rework this but make the ordering via
>     >>>>an engine
>     >>>>   param
>     >>>>   >       optional.
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds are ordered
>     >>>>within the
>     >>>>   VM
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds accept an
>     engine
>     >>>>   argument
>     >>>>   >       (in
>     >>>>   >       >the case of the i915 likely this is a gem context
>     >>>>handle) and
>     >>>>   binds
>     >>>>   >       >ordered with respect to that engine.
>     >>>>   >       >
>     >>>>   >       >This gives UMDs options as the later likely consumes
>     >>>>more KMD
>     >>>>   >       resources
>     >>>>   >       >so if a different UMD can live with binds being
>     >>>>ordered within
>     >>>>   the VM
>     >>>>   >       >they can use a mode consuming less resources.
>     >>>>   >       >
>     >>>>   >
>     >>>>   >       I think we need to be careful here if we are
>     looking for some
>     >>>>   out of
>     >>>>   >       (submission) order completion of vm_bind/unbind.
>     >>>>   >       In-order completion means, in a batch of binds and
>     >>>>unbinds to be
>     >>>>   >       completed in-order, user only needs to specify
>     >>>>in-fence for the
>     >>>>   >       first bind/unbind call and the our-fence for the last
>     >>>>   bind/unbind
>     >>>>   >       call. Also, the VA released by an unbind call can be
>     >>>>re-used by
>     >>>>   >       any subsequent bind call in that in-order batch.
>     >>>>   >
>     >>>>   >       These things will break if binding/unbinding were to
>     >>>>be allowed
>     >>>>   to
>     >>>>   >       go out of order (of submission) and user need to be
>     extra
>     >>>>   careful
>     >>>>   >       not to run into pre-mature triggereing of out-fence
>     and bind
>     >>>>   failing
>     >>>>   >       as VA is still in use etc.
>     >>>>   >
>     >>>>   >       Also, VM_BIND binds the provided mapping on the
>     specified
>     >>>>   address
>     >>>>   >       space
>     >>>>   >       (VM). So, the uapi is not engine/context specific.
>     >>>>   >
>     >>>>   >       We can however add a 'queue' to the uapi which can be
>     >>>>one from
>     >>>>   the
>     >>>>   >       pre-defined queues,
>     >>>>   >       I915_VM_BIND_QUEUE_0
>     >>>>   >       I915_VM_BIND_QUEUE_1
>     >>>>   >       ...
>     >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>     >>>>   >
>     >>>>   >       KMD will spawn an async work queue for each queue
>     which will
>     >>>>   only
>     >>>>   >       bind the mappings on that queue in the order of
>     submission.
>     >>>>   >       User can assign the queue to per engine or anything
>     >>>>like that.
>     >>>>   >
>     >>>>   >       But again here, user need to be careful and not
>     >>>>deadlock these
>     >>>>   >       queues with circular dependency of fences.
>     >>>>   >
>     >>>>   >       I prefer adding this later an as extension based on
>     >>>>whether it
>     >>>>   >       is really helping with the implementation.
>     >>>>   >
>     >>>>   >     I can tell you right now that having everything on a
>     single
>     >>>>   in-order
>     >>>>   >     queue will not get us the perf we want.  What vulkan
>     >>>>really wants
>     >>>>   is one
>     >>>>   >     of two things:
>     >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>     happen in
>     >>>>   whatever
>     >>>>   >     their dependencies are resolved and we ensure ordering
>     >>>>ourselves
>     >>>>   by
>     >>>>   >     having a syncobj in the VkQueue.
>     >>>>   >      2. The ability to create multiple VM_BIND queues. 
>     We need at
>     >>>>   least 2
>     >>>>   >     but I don't see why there needs to be a limit besides
>     >>>>the limits
>     >>>>   the
>     >>>>   >     i915 API already has on the number of engines. 
>     Vulkan could
>     >>>>   expose
>     >>>>   >     multiple sparse binding queues to the client if it's not
>     >>>>   arbitrarily
>     >>>>   >     limited.
>     >>>>
>     >>>>   Thanks Jason, Lionel.
>     >>>>
>     >>>>   Jason, what are you referring to when you say "limits the
>     i915 API
>     >>>>   already
>     >>>>   has on the number of engines"? I am not sure if there is
>     such an uapi
>     >>>>   today.
>     >>>>
>     >>>> There's a limit of something like 64 total engines today
>     based on the
>     >>>> number of bits we can cram into the exec flags in
>     execbuffer2.  I think
>     >>>> someone had an extended version that allowed more but I
>     ripped it out
>     >>>> because no one was using it.  Of course, execbuffer3 might not
>     >>>>have that
>     >>>> problem at all.
>     >>>>
>     >>>
>     >>>Thanks Jason.
>     >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>     probably
>     >>>will not have this limiation. So, we need to define a
>     VM_BIND_MAX_QUEUE
>     >>>and somehow export it to user (I am thinking of embedding it in
>     >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>     meaning 2^n
>     >>>queues.
>     >>
>     >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f)
>     which execbuf3
>
>
> Yup!  That's exactly the limit I was talking about.
>
>     >>will also have. So, we can simply define in vm_bind/unbind
>     structures,
>     >>
>     >>#define I915_VM_BIND_MAX_QUEUE   64
>     >>        __u32 queue;
>     >>
>     >>I think that will keep things simple.
>     >
>     >Hmmm? What does execbuf2 limit has to do with how many engines
>     >hardware can have? I suggest not to do that.
>     >
>     >Change with added this:
>     >
>     >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>     >               return -EINVAL;
>     >
>     >To context creation needs to be undone and so let users create
>     engine
>     >maps with all hardware engines, and let execbuf3 access them all.
>     >
>
>     Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>     Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>     make it N+1).
>     But, as discussed in other thread of this RFC series, we are planning
>     to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>     any uapi that limits the number of engines (and hence the vm_bind
>     queues
>     need to be supported).
>
>     If we leave the number of vm_bind queues to be arbitrarily large
>     (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>     work_item and a linked list) lookup from the user specified queue
>     index.
>     Other option is to just put some hard limit (say 64 or 65) and use
>     an array of queues in VM (each created upon first use). I prefer this.
>
>
> I don't get why a VM_BIND queue is any different from any other queue 
> or userspace-visible kernel object.  But I'll leave those details up 
> to danvet or whoever else might be reviewing the implementation.
>
> --Jason


I kind of agree here. Wouldn't be simpler to have the bind queue created 
like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when 
submitting.

If there is ever a possibility to have this work on the GPU, it would be 
all ready.


Thanks,


-Lionel


>
>
>     Niranjana
>
>     >Regards,
>     >
>     >Tvrtko
>     >
>     >>
>     >>Niranjana
>     >>
>     >>>
>     >>>>   I am trying to see how many queues we need and don't want
>     it to be
>     >>>>   arbitrarily
>     >>>>   large and unduely blow up memory usage and complexity in
>     i915 driver.
>     >>>>
>     >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>     >>>>of cases. I
>     >>>> could imagine a client wanting to create more than 1 sparse
>     >>>>queue in which
>     >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>     >>>>goes, once
>     >>>> you allow two, I don't think the complexity is going up by
>     >>>>allowing N.  As
>     >>>> for memory usage, creating more queues means more memory. 
>     That's a
>     >>>> trade-off that userspace can make.  Again, the expected number
>     >>>>here is 1
>     >>>> or 2 in the vast majority of cases so I don't think you need
>     to worry.
>     >>>
>     >>>Ok, will start with n=3 meaning 8 queues.
>     >>>That would require us create 8 workqueues.
>     >>>We can change 'n' later if required.
>     >>>
>     >>>Niranjana
>     >>>
>     >>>>
>     >>>>   >     Why?  Because Vulkan has two basic kind of bind
>     >>>>operations and we
>     >>>>   don't
>     >>>>   >     want any dependencies between them:
>     >>>>   >      1. Immediate.  These happen right after BO creation or
>     >>>>maybe as
>     >>>>   part of
>     >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>     >>>>don't happen
>     >>>>   on a
>     >>>>   >     queue and we don't want them serialized with
>     anything.  To
>     >>>>   synchronize
>     >>>>   >     with submit, we'll have a syncobj in the VkDevice
>     which is
>     >>>>   signaled by
>     >>>>   >     all immediate bind operations and make submits wait
>     on it.
>     >>>>   >      2. Queued (sparse): These happen on a VkQueue which
>     may be the
>     >>>>   same as
>     >>>>   >     a render/compute queue or may be its own queue.  It's
>     up to us
>     >>>>   what we
>     >>>>   >     want to advertise.  From the Vulkan API PoV, this is
>     like any
>     >>>>   other
>     >>>>   >     queue.  Operations on it wait on and signal
>     semaphores.  If we
>     >>>>   have a
>     >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>     >>>>signal just like
>     >>>>   we do
>     >>>>   >     in execbuf().
>     >>>>   >     The important thing is that we don't want one type of
>     >>>>operation to
>     >>>>   block
>     >>>>   >     on the other.  If immediate binds are blocking on
>     sparse binds,
>     >>>>   it's
>     >>>>   >     going to cause over-synchronization issues.
>     >>>>   >     In terms of the internal implementation, I know that
>     >>>>there's going
>     >>>>   to be
>     >>>>   >     a lock on the VM and that we can't actually do these
>     things in
>     >>>>   >     parallel.  That's fine.  Once the dma_fences have
>     signaled and
>     >>>>   we're
>     >>>>
>     >>>>   Thats correct. It is like a single VM_BIND engine with
>     >>>>multiple queues
>     >>>>   feeding to it.
>     >>>>
>     >>>> Right.  As long as the queues themselves are independent and
>     >>>>can block on
>     >>>> dma_fences without holding up other queues, I think we're fine.
>     >>>>
>     >>>>   >     unblocked to do the bind operation, I don't care if
>     >>>>there's a bit
>     >>>>   of
>     >>>>   >     synchronization due to locking. That's expected.  What
>     >>>>we can't
>     >>>>   afford
>     >>>>   >     to have is an immediate bind operation suddenly
>     blocking on a
>     >>>>   sparse
>     >>>>   >     operation which is blocked on a compute job that's
>     going to run
>     >>>>   for
>     >>>>   >     another 5ms.
>     >>>>
>     >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't
>     block the
>     >>>>   VM_BIND
>     >>>>   on other VMs. I am not sure about usecases here, but just
>     wanted to
>     >>>>   clarify.
>     >>>>
>     >>>> Yes, that's what I would expect.
>     >>>> --Jason
>     >>>>
>     >>>>   Niranjana
>     >>>>
>     >>>>   >     For reference, Windows solves this by allowing
>     arbitrarily many
>     >>>>   paging
>     >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>     >>>>design works
>     >>>>   >     pretty well and solves the problems in question.
>     >>>>Again, we could
>     >>>>   just
>     >>>>   >     make everything out-of-order and require using syncobjs
>     >>>>to order
>     >>>>   things
>     >>>>   >     as userspace wants. That'd be fine too.
>     >>>>   >     One more note while I'm here: danvet said something on
>     >>>>IRC about
>     >>>>   VM_BIND
>     >>>>   >     queues waiting for syncobjs to materialize.  We don't
>     really
>     >>>>   want/need
>     >>>>   >     this.  We already have all the machinery in userspace
>     to handle
>     >>>>   >     wait-before-signal and waiting for syncobj fences to
>     >>>>materialize
>     >>>>   and
>     >>>>   >     that machinery is on by default.  It would actually
>     >>>>take MORE work
>     >>>>   in
>     >>>>   >     Mesa to turn it off and take advantage of the kernel
>     >>>>being able to
>     >>>>   wait
>     >>>>   >     for syncobjs to materialize. Also, getting that right is
>     >>>>   ridiculously
>     >>>>   >     hard and I really don't want to get it wrong in kernel
>     >>>>space.     When we
>     >>>>   >     do memory fences, wait-before-signal will be a
>     thing.  We don't
>     >>>>   need to
>     >>>>   >     try and make it a thing for syncobj.
>     >>>>   >     --Jason
>     >>>>   >
>     >>>>   >   Thanks Jason,
>     >>>>   >
>     >>>>   >   I missed the bit in the Vulkan spec that we're allowed
>     to have a
>     >>>>   sparse
>     >>>>   >   queue that does not implement either graphics or compute
>     >>>>operations
>     >>>>   :
>     >>>>   >
>     >>>>   >     "While some implementations may include
>     >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>     >>>>   >     support in queue families that also include
>     >>>>   >
>     >>>>   >      graphics and compute support, other implementations
>     may only
>     >>>>   expose a
>     >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >>>>   >
>     >>>>   >      family."
>     >>>>   >
>     >>>>   >   So it can all be all a vm_bind engine that just does
>     bind/unbind
>     >>>>   >   operations.
>     >>>>   >
>     >>>>   >   But yes we need another engine for the immediate/non-sparse
>     >>>>   operations.
>     >>>>   >
>     >>>>   >   -Lionel
>     >>>>   >
>     >>>>   >         >
>     >>>>   >       Daniel, any thoughts?
>     >>>>   >
>     >>>>   >       Niranjana
>     >>>>   >
>     >>>>   >       >Matt
>     >>>>   >       >
>     >>>>   >       >>
>     >>>>   >       >> Sorry I noticed this late.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> -Lionel
>     >>>>   >       >>
>     >>>>   >       >>
>

[-- Attachment #2: Type: text/html, Size: 36501 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-09  8:36               ` Matthew Auld
@ 2022-06-09 18:53                 ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-09 18:53 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Tvrtko Ursulin, intel-gfx, chris.p.wilson, thomas.hellstrom,
	dri-devel, daniel.vetter, christian.koenig

On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
>On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>>>On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>>>
>>>>On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>>>On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>>On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>>>VM_BIND and related uapi definitions
>>>>>>>
>>>>>>>v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>    Also add new uapi and documentation as per review comments
>>>>>>>    from Daniel.
>>>>>>>
>>>>>>>Signed-off-by: Niranjana Vishwanathapura 
>>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>---
>>>>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>+++++++++++++++++++++++++++
>>>>>>> 1 file changed, 399 insertions(+)
>>>>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>
>>>>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>new file mode 100644
>>>>>>>index 000000000000..589c0a009107
>>>>>>>--- /dev/null
>>>>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>@@ -0,0 +1,399 @@
>>>>>>>+/* SPDX-License-Identifier: MIT */
>>>>>>>+/*
>>>>>>>+ * Copyright © 2022 Intel Corporation
>>>>>>>+ */
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>+ *
>>>>>>>+ * VM_BIND feature availability.
>>>>>>>+ * See typedef drm_i915_getparam_t param.
>>>>>>>+ */
>>>>>>>+#define I915_PARAM_HAS_VM_BIND        57
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>+ *
>>>>>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>+ * See struct drm_i915_gem_vm_control flags.
>>>>>>>+ *
>>>>>>>+ * A VM in VM_BIND mode will not support the older 
>>>>>>>execbuff mode of binding.
>>>>>>>+ * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>>execlist (ie., the
>>>>>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>>must be provided
>>>>>>>+ * to pass in the batch buffer addresses.
>>>>>>>+ *
>>>>>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>+ * I915_EXEC_BATCH_FIRST of 
>>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS 
>>>>>>>flag must always be
>>>>>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>+ * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>>batch_len fields
>>>>>>>+ * of struct drm_i915_gem_execbuffer2 are also not used 
>>>>>>>and must be 0.
>>>>>>>+ */
>>>>>>>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>>>+ *
>>>>>>>+ * Flag to declare context as long running.
>>>>>>>+ * See struct drm_i915_gem_context_create_ext flags.
>>>>>>>+ *
>>>>>>>+ * Usage of dma-fence expects that they complete in 
>>>>>>>reasonable amount of time.
>>>>>>>+ * Compute on the other hand can be long running. Hence 
>>>>>>>it is not appropriate
>>>>>>>+ * for compute contexts to export request completion 
>>>>>>>dma-fence to user.
>>>>>>>+ * The dma-fence usage will be limited to in-kernel 
>>>>>>>consumption only.
>>>>>>>+ * Compute contexts need to use user/memory fence.
>>>>>>>+ *
>>>>>>>+ * So, long running contexts do not support output fences. Hence,
>>>>>>>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>>>+ * I915_EXEC_FENCE_SIGNAL (See 
>>>>>>>&drm_i915_gem_exec_fence.flags) are expected
>>>>>>>+ * to be not used.
>>>>>>>+ *
>>>>>>>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for 
>>>>>>>objects mapped
>>>>>>>+ * to long running contexts.
>>>>>>>+ */
>>>>>>>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>>>+
>>>>>>>+/* VM_BIND related ioctls */
>>>>>>>+#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>>>+#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>>>+#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>>>+
>>>>>>>+#define DRM_IOCTL_I915_GEM_VM_BIND 
>>>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct 
>>>>>>>drm_i915_gem_vm_bind)
>>>>>>>+#define DRM_IOCTL_I915_GEM_VM_UNBIND 
>>>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct 
>>>>>>>drm_i915_gem_vm_bind)
>>>>>>>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, 
>>>>>>>struct drm_i915_gem_wait_user_fence)
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>>>+ *
>>>>>>>+ * This structure is passed to VM_BIND ioctl and 
>>>>>>>specifies the mapping of GPU
>>>>>>>+ * virtual address (VA) range to the section of an object 
>>>>>>>that should be bound
>>>>>>>+ * in the device page table of the specified address space (VM).
>>>>>>>+ * The VA range specified must be unique (ie., not 
>>>>>>>currently bound) and can
>>>>>>>+ * be mapped to whole object or a section of the object 
>>>>>>>(partial binding).
>>>>>>>+ * Multiple VA mappings can be created to the same 
>>>>>>>section of the object
>>>>>>>+ * (aliasing).
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_vm_bind {
>>>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>>>+    __u32 vm_id;
>>>>>>>+
>>>>>>>+    /** @handle: Object handle */
>>>>>>>+    __u32 handle;
>>>>>>>+
>>>>>>>+    /** @start: Virtual Address start to bind */
>>>>>>>+    __u64 start;
>>>>>>>+
>>>>>>>+    /** @offset: Offset in object to bind */
>>>>>>>+    __u64 offset;
>>>>>>>+
>>>>>>>+    /** @length: Length of mapping to bind */
>>>>>>>+    __u64 length;
>>>>>>
>>>>>>Does it support, or should it, equivalent of 
>>>>>>EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to 
>>>>>>map the remainder of the space to a dummy object? In which 
>>>>>>case would there be any alignment/padding issues preventing 
>>>>>>the two bind to be placed next to each other?
>>>>>>
>>>>>>I ask because someone from the compute side asked me about a 
>>>>>>problem with their strategy of dealing with overfetch and I 
>>>>>>suggested pad to size.
>>>>>>
>>>>>
>>>>>Thanks Tvrtko,
>>>>>I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>>>is completely pushed to userspace, no padding should be necessary
>>>>>once the 'start' and 'size' alignment conditions are met.
>>>>>
>>>>>I will add some documentation on alignment requirement here.
>>>>>Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>>>when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>>>be 64K aligned.
>>>>
>>>>+ Matt
>>>>
>>>>Align to 64k is enough for all overfetch issues?
>>>>
>>>>Apparently compute has a situation where a buffer is received by 
>>>>one component and another has to apply more alignment to it, to 
>>>>deal with overfetch. Since they cannot grow the actual BO if 
>>>>they wanted to VM_BIND a scratch area on top? Or perhaps none of 
>>>>this is a problem on discrete and original BO should be 
>>>>correctly allocated to start with.
>>>>
>>>>Side question - what about the align to 2MiB mentioned in 
>>>>i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not 
>>>>apply to discrete?
>>>
>>>Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>>>require a minimum of 64K pages underneath for local memory, and 
>>>the BO size will also be rounded up accordingly. And yeah the 
>>>complication arises due to not being able to mix 4K + 64K GTT 
>>>pages within the same page-table (existed since even gen8). Note 
>>>that 4K here is what we typically get for system memory.
>>>
>>>Originally we had a memory coloring scheme to track the "color" of 
>>>each page-table, which basically ensures that userspace can't do 
>>>something nasty like mixing page sizes. The advantage of that 
>>>scheme is that we would only require 64K GTT alignment and no 
>>>extra padding, but is perhaps a little complex.
>>>
>>>The merged solution is just to align and pad (i.e vma->node.size 
>>>and not vma->size) out of the vma to 2M, which is dead simple 
>>>implementation wise, but does potentially waste some GTT space and 
>>>some of the local memory used for the actual page-table. For the 
>>>alignment the kernel just validates that the GTT address is 
>>>aligned to 2M in vma_insert(), and then for the padding it just 
>>>inflates it to 2M, if userspace hasn't already.
>>>
>>>See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext
>>>
>>>
>>
>>Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
>>This is unfortunate, but it is not something new enforced by VM_BIND.
>>Other option is to go with 64K alignment and in VM_BIND case, user
>>must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
>>mappings in the same 2M range. But this is not VM_BIND specific
>>(will apply to soft-pinning in execbuf2 also).
>>
>>I don't think we need any VA padding here as with VM_BIND VA is
>>managed fully by the user. If we enforce VA to be 2M aligned, it
>>will leave holes (if BOs are smaller then 2M), but nobody is going
>>to allocate anything form there.
>
>Note that we only apply the 2M alignment + padding for local memory 
>pages, for system memory we don't have/need such restrictions. The VA 
>padding then importantly prevents userspace from incorrectly (or 
>maliciously) inserting 4K system memory object in some page-table 
>operating in 64K GTT mode.
>

Thanks Matt.
I also, syned offline with Matt a bit on this.
We don't need explicit 'pad_to_size' size. i915 driver is implicitly
padding the size to 2M boundary for LMEM BOs which will apply for
VM_BIND also.
The remaining question is whether we enforce 2M VA alignment for
lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with
just 64K alignment but ensure there is no mixing of 4K and 64K
mappings in same 2M range. I think we can go with 2M alignment
requirement for VM_BIND also. So, no new requirements here for VM_BIND.

I will update the documentation.

Niranjana

>>
>>Niranjana
>>
>>>>
>>>>Regards,
>>>>
>>>>Tvrtko
>>>>
>>>>>
>>>>>Niranjana
>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>Tvrtko
>>>>>>
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_GEM_VM_BIND_READONLY:
>>>>>>>+     * Mapping is read-only.
>>>>>>>+     *
>>>>>>>+     * I915_GEM_VM_BIND_CAPTURE:
>>>>>>>+     * Capture this mapping in the dump upon GPU error.
>>>>>>>+     */
>>>>>>>+    __u64 flags;
>>>>>>>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>>>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>>>+
>>>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>>>this mapping. */
>>>>>>>+    __u64 extensions;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>>>+ *
>>>>>>>+ * This structure is passed to VM_UNBIND ioctl and 
>>>>>>>specifies the GPU virtual
>>>>>>>+ * address (VA) range that should be unbound from the 
>>>>>>>device page table of the
>>>>>>>+ * specified address space (VM). The specified VA range 
>>>>>>>must match one of the
>>>>>>>+ * mappings created with the VM_BIND ioctl. TLB is 
>>>>>>>flushed upon unbind
>>>>>>>+ * completion.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_vm_unbind {
>>>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>>>+    __u32 vm_id;
>>>>>>>+
>>>>>>>+    /** @rsvd: Reserved for future use; must be zero. */
>>>>>>>+    __u32 rsvd;
>>>>>>>+
>>>>>>>+    /** @start: Virtual Address start to unbind */
>>>>>>>+    __u64 start;
>>>>>>>+
>>>>>>>+    /** @length: Length of mapping to unbind */
>>>>>>>+    __u64 length;
>>>>>>>+
>>>>>>>+    /** @flags: reserved for future usage, currently MBZ */
>>>>>>>+    __u64 flags;
>>>>>>>+
>>>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>>>this mapping. */
>>>>>>>+    __u64 extensions;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_fence - An input or output 
>>>>>>>fence for the vm_bind
>>>>>>>+ * or the vm_unbind work.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind aync worker will wait for 
>>>>>>>input fence to signal
>>>>>>>+ * before starting the binding or unbinding.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>returned output fence
>>>>>>>+ * after the completion of binding or unbinding.
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_fence {
>>>>>>>+    /** @handle: User's handle for a drm_syncobj to wait 
>>>>>>>on or signal. */
>>>>>>>+    __u32 handle;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_FENCE_WAIT:
>>>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>>>+     * Return bind/unbind completion fence as output
>>>>>>>+     */
>>>>>>>+    __u32 flags;
>>>>>>>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>>>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>>>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>>>(-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline 
>>>>>>>fences for vm_bind
>>>>>>>+ * and vm_unbind.
>>>>>>>+ *
>>>>>>>+ * This structure describes an array of timeline 
>>>>>>>drm_syncobj and associated
>>>>>>>+ * points for timeline variants of drm_syncobj. These 
>>>>>>>timeline 'drm_syncobj's
>>>>>>>+ * can be input or output fences (See struct 
>>>>>>>drm_i915_vm_bind_fence).
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>>>+#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @fence_count: Number of elements in the 
>>>>>>>@handles_ptr & @value_ptr
>>>>>>>+     * arrays.
>>>>>>>+     */
>>>>>>>+    __u64 fence_count;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @handles_ptr: Pointer to an array of struct 
>>>>>>>drm_i915_vm_bind_fence
>>>>>>>+     * of length @fence_count.
>>>>>>>+     */
>>>>>>>+    __u64 handles_ptr;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @values_ptr: Pointer to an array of u64 values of length
>>>>>>>+     * @fence_count.
>>>>>>>+     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>>>>>+     * timeline drm_syncobj is invalid as it turns a 
>>>>>>>drm_syncobj into a
>>>>>>>+     * binary one.
>>>>>>>+     */
>>>>>>>+    __u64 values_ptr;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_user_fence - An input or 
>>>>>>>output user fence for the
>>>>>>>+ * vm_bind or the vm_unbind work.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind aync worker will wait for the 
>>>>>>>input fence (value at
>>>>>>>+ * @addr to become equal to @val) before starting the 
>>>>>>>binding or unbinding.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>output fence after
>>>>>>>+ * the completion of binding or unbinding by writing @val 
>>>>>>>to memory location at
>>>>>>>+ * @addr
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_user_fence {
>>>>>>>+    /** @addr: User/Memory fence qword aligned process 
>>>>>>>virtual address */
>>>>>>>+    __u64 addr;
>>>>>>>+
>>>>>>>+    /** @val: User/Memory fence value to be written after 
>>>>>>>bind completion */
>>>>>>>+    __u64 val;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>>>+     * Return bind/unbind completion fence as output
>>>>>>>+     */
>>>>>>>+    __u32 flags;
>>>>>>>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>>>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>>>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>>>+    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory 
>>>>>>>fences for vm_bind
>>>>>>>+ * and vm_unbind.
>>>>>>>+ *
>>>>>>>+ * These user fences can be input or output fences
>>>>>>>+ * (See struct drm_i915_vm_bind_user_fence).
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_ext_user_fence {
>>>>>>>+#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /** @fence_count: Number of elements in the 
>>>>>>>@user_fence_ptr array. */
>>>>>>>+    __u64 fence_count;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @user_fence_ptr: Pointer to an array of
>>>>>>>+     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>>>+     */
>>>>>>>+    __u64 user_fence_ptr;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - 
>>>>>>>Array of batch buffer
>>>>>>>+ * gpu virtual addresses.
>>>>>>>+ *
>>>>>>>+ * In the execbuff ioctl (See struct 
>>>>>>>drm_i915_gem_execbuffer2), this extension
>>>>>>>+ * must always be appended in the VM_BIND mode and it 
>>>>>>>will be an error to
>>>>>>>+ * append this extension in older non-VM_BIND mode.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /** @count: Number of addresses in the addr array. */
>>>>>>>+    __u32 count;
>>>>>>>+
>>>>>>>+    /** @addr: An array of batch gpu virtual addresses. */
>>>>>>>+    __u64 addr[0];
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First 
>>>>>>>level batch completion
>>>>>>>+ * signaling extension.
>>>>>>>+ *
>>>>>>>+ * This extension allows user to attach a user fence 
>>>>>>>(@addr, @value pair) to an
>>>>>>>+ * execbuf to be signaled by the command streamer after 
>>>>>>>the completion of first
>>>>>>>+ * level batch, by writing the @value at specified @addr 
>>>>>>>and triggering an
>>>>>>>+ * interrupt.
>>>>>>>+ * User can either poll for this user fence to signal or 
>>>>>>>can also wait on it
>>>>>>>+ * with i915_gem_wait_user_fence ioctl.
>>>>>>>+ * This is very much usefaul for long running contexts 
>>>>>>>where waiting on dma-fence
>>>>>>>+ * by user (like i915_gem_wait ioctl) is not supported.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>>>+     *
>>>>>>>+     * Address has to be a valid GPU virtual address at the time of
>>>>>>>+     * first level batch completion.
>>>>>>>+     */
>>>>>>>+    __u64 addr;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @value: User/Memory fence Value to be written to 
>>>>>>>above address
>>>>>>>+     * after first level batch completes.
>>>>>>>+     */
>>>>>>>+    __u64 value;
>>>>>>>+
>>>>>>>+    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>>>+    __u64 rsvd;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_create_ext_vm_private - Extension 
>>>>>>>to make the object
>>>>>>>+ * private to the specified VM.
>>>>>>>+ *
>>>>>>>+ * See struct drm_i915_gem_create_ext.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_create_ext_vm_private {
>>>>>>>+#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /** @vm_id: Id of the VM to which the object is private */
>>>>>>>+    __u32 vm_id;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>>>>>+ *
>>>>>>>+ * User/Memory fence can be woken up either by:
>>>>>>>+ *
>>>>>>>+ * 1. GPU context indicated by @ctx_id, or,
>>>>>>>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>>>+ *    @ctx_id is ignored when this flag is set.
>>>>>>>+ *
>>>>>>>+ * Wakeup condition is,
>>>>>>>+ * ``((*addr & mask) op (value & mask))``
>>>>>>>+ *
>>>>>>>+ * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>>>><indefinite_dma_fences>`
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_wait_user_fence {
>>>>>>>+    /** @extensions: Zero-terminated chain of extensions. */
>>>>>>>+    __u64 extensions;
>>>>>>>+
>>>>>>>+    /** @addr: User/Memory fence address */
>>>>>>>+    __u64 addr;
>>>>>>>+
>>>>>>>+    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>>>+    __u32 ctx_id;
>>>>>>>+
>>>>>>>+    /** @op: Wakeup condition operator */
>>>>>>>+    __u16 op;
>>>>>>>+#define I915_UFENCE_WAIT_EQ      0
>>>>>>>+#define I915_UFENCE_WAIT_NEQ     1
>>>>>>>+#define I915_UFENCE_WAIT_GT      2
>>>>>>>+#define I915_UFENCE_WAIT_GTE     3
>>>>>>>+#define I915_UFENCE_WAIT_LT      4
>>>>>>>+#define I915_UFENCE_WAIT_LTE     5
>>>>>>>+#define I915_UFENCE_WAIT_BEFORE  6
>>>>>>>+#define I915_UFENCE_WAIT_AFTER   7
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_UFENCE_WAIT_SOFT:
>>>>>>>+     *
>>>>>>>+     * To be woken up by i915 driver async worker (not by GPU).
>>>>>>>+     *
>>>>>>>+     * I915_UFENCE_WAIT_ABSTIME:
>>>>>>>+     *
>>>>>>>+     * Wait timeout specified as absolute time.
>>>>>>>+     */
>>>>>>>+    __u16 flags;
>>>>>>>+#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>>>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>>>+
>>>>>>>+    /** @value: Wakeup value */
>>>>>>>+    __u64 value;
>>>>>>>+
>>>>>>>+    /** @mask: Wakeup mask */
>>>>>>>+    __u64 mask;
>>>>>>>+#define I915_UFENCE_WAIT_U8     0xffu
>>>>>>>+#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>>>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>>>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @timeout: Wait timeout in nanoseconds.
>>>>>>>+     *
>>>>>>>+     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time 
>>>>>>>timeout is the
>>>>>>>+     * absolute time in nsec.
>>>>>>>+     */
>>>>>>>+    __s64 timeout;
>>>>>>>+};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-09 18:53                 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-09 18:53 UTC (permalink / raw)
  To: Matthew Auld
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
>On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>>>On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>>>
>>>>On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>>>On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>>On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>>>VM_BIND and related uapi definitions
>>>>>>>
>>>>>>>v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>    Also add new uapi and documentation as per review comments
>>>>>>>    from Daniel.
>>>>>>>
>>>>>>>Signed-off-by: Niranjana Vishwanathapura 
>>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>---
>>>>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>+++++++++++++++++++++++++++
>>>>>>> 1 file changed, 399 insertions(+)
>>>>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>
>>>>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>new file mode 100644
>>>>>>>index 000000000000..589c0a009107
>>>>>>>--- /dev/null
>>>>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>@@ -0,0 +1,399 @@
>>>>>>>+/* SPDX-License-Identifier: MIT */
>>>>>>>+/*
>>>>>>>+ * Copyright © 2022 Intel Corporation
>>>>>>>+ */
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>+ *
>>>>>>>+ * VM_BIND feature availability.
>>>>>>>+ * See typedef drm_i915_getparam_t param.
>>>>>>>+ */
>>>>>>>+#define I915_PARAM_HAS_VM_BIND        57
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>+ *
>>>>>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>+ * See struct drm_i915_gem_vm_control flags.
>>>>>>>+ *
>>>>>>>+ * A VM in VM_BIND mode will not support the older 
>>>>>>>execbuff mode of binding.
>>>>>>>+ * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>>execlist (ie., the
>>>>>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>>must be provided
>>>>>>>+ * to pass in the batch buffer addresses.
>>>>>>>+ *
>>>>>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>+ * I915_EXEC_BATCH_FIRST of 
>>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS 
>>>>>>>flag must always be
>>>>>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>+ * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>>batch_len fields
>>>>>>>+ * of struct drm_i915_gem_execbuffer2 are also not used 
>>>>>>>and must be 0.
>>>>>>>+ */
>>>>>>>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>>>+ *
>>>>>>>+ * Flag to declare context as long running.
>>>>>>>+ * See struct drm_i915_gem_context_create_ext flags.
>>>>>>>+ *
>>>>>>>+ * Usage of dma-fence expects that they complete in 
>>>>>>>reasonable amount of time.
>>>>>>>+ * Compute on the other hand can be long running. Hence 
>>>>>>>it is not appropriate
>>>>>>>+ * for compute contexts to export request completion 
>>>>>>>dma-fence to user.
>>>>>>>+ * The dma-fence usage will be limited to in-kernel 
>>>>>>>consumption only.
>>>>>>>+ * Compute contexts need to use user/memory fence.
>>>>>>>+ *
>>>>>>>+ * So, long running contexts do not support output fences. Hence,
>>>>>>>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>>>+ * I915_EXEC_FENCE_SIGNAL (See 
>>>>>>>&drm_i915_gem_exec_fence.flags) are expected
>>>>>>>+ * to be not used.
>>>>>>>+ *
>>>>>>>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for 
>>>>>>>objects mapped
>>>>>>>+ * to long running contexts.
>>>>>>>+ */
>>>>>>>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>>>+
>>>>>>>+/* VM_BIND related ioctls */
>>>>>>>+#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>>>+#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>>>+#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>>>+
>>>>>>>+#define DRM_IOCTL_I915_GEM_VM_BIND 
>>>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct 
>>>>>>>drm_i915_gem_vm_bind)
>>>>>>>+#define DRM_IOCTL_I915_GEM_VM_UNBIND 
>>>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct 
>>>>>>>drm_i915_gem_vm_bind)
>>>>>>>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>>>DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, 
>>>>>>>struct drm_i915_gem_wait_user_fence)
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>>>+ *
>>>>>>>+ * This structure is passed to VM_BIND ioctl and 
>>>>>>>specifies the mapping of GPU
>>>>>>>+ * virtual address (VA) range to the section of an object 
>>>>>>>that should be bound
>>>>>>>+ * in the device page table of the specified address space (VM).
>>>>>>>+ * The VA range specified must be unique (ie., not 
>>>>>>>currently bound) and can
>>>>>>>+ * be mapped to whole object or a section of the object 
>>>>>>>(partial binding).
>>>>>>>+ * Multiple VA mappings can be created to the same 
>>>>>>>section of the object
>>>>>>>+ * (aliasing).
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_vm_bind {
>>>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>>>+    __u32 vm_id;
>>>>>>>+
>>>>>>>+    /** @handle: Object handle */
>>>>>>>+    __u32 handle;
>>>>>>>+
>>>>>>>+    /** @start: Virtual Address start to bind */
>>>>>>>+    __u64 start;
>>>>>>>+
>>>>>>>+    /** @offset: Offset in object to bind */
>>>>>>>+    __u64 offset;
>>>>>>>+
>>>>>>>+    /** @length: Length of mapping to bind */
>>>>>>>+    __u64 length;
>>>>>>
>>>>>>Does it support, or should it, equivalent of 
>>>>>>EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to 
>>>>>>map the remainder of the space to a dummy object? In which 
>>>>>>case would there be any alignment/padding issues preventing 
>>>>>>the two bind to be placed next to each other?
>>>>>>
>>>>>>I ask because someone from the compute side asked me about a 
>>>>>>problem with their strategy of dealing with overfetch and I 
>>>>>>suggested pad to size.
>>>>>>
>>>>>
>>>>>Thanks Tvrtko,
>>>>>I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>>>is completely pushed to userspace, no padding should be necessary
>>>>>once the 'start' and 'size' alignment conditions are met.
>>>>>
>>>>>I will add some documentation on alignment requirement here.
>>>>>Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>>>when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>>>be 64K aligned.
>>>>
>>>>+ Matt
>>>>
>>>>Align to 64k is enough for all overfetch issues?
>>>>
>>>>Apparently compute has a situation where a buffer is received by 
>>>>one component and another has to apply more alignment to it, to 
>>>>deal with overfetch. Since they cannot grow the actual BO if 
>>>>they wanted to VM_BIND a scratch area on top? Or perhaps none of 
>>>>this is a problem on discrete and original BO should be 
>>>>correctly allocated to start with.
>>>>
>>>>Side question - what about the align to 2MiB mentioned in 
>>>>i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not 
>>>>apply to discrete?
>>>
>>>Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>>>require a minimum of 64K pages underneath for local memory, and 
>>>the BO size will also be rounded up accordingly. And yeah the 
>>>complication arises due to not being able to mix 4K + 64K GTT 
>>>pages within the same page-table (existed since even gen8). Note 
>>>that 4K here is what we typically get for system memory.
>>>
>>>Originally we had a memory coloring scheme to track the "color" of 
>>>each page-table, which basically ensures that userspace can't do 
>>>something nasty like mixing page sizes. The advantage of that 
>>>scheme is that we would only require 64K GTT alignment and no 
>>>extra padding, but is perhaps a little complex.
>>>
>>>The merged solution is just to align and pad (i.e vma->node.size 
>>>and not vma->size) out of the vma to 2M, which is dead simple 
>>>implementation wise, but does potentially waste some GTT space and 
>>>some of the local memory used for the actual page-table. For the 
>>>alignment the kernel just validates that the GTT address is 
>>>aligned to 2M in vma_insert(), and then for the padding it just 
>>>inflates it to 2M, if userspace hasn't already.
>>>
>>>See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext
>>>
>>>
>>
>>Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
>>This is unfortunate, but it is not something new enforced by VM_BIND.
>>Other option is to go with 64K alignment and in VM_BIND case, user
>>must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
>>mappings in the same 2M range. But this is not VM_BIND specific
>>(will apply to soft-pinning in execbuf2 also).
>>
>>I don't think we need any VA padding here as with VM_BIND VA is
>>managed fully by the user. If we enforce VA to be 2M aligned, it
>>will leave holes (if BOs are smaller then 2M), but nobody is going
>>to allocate anything form there.
>
>Note that we only apply the 2M alignment + padding for local memory 
>pages, for system memory we don't have/need such restrictions. The VA 
>padding then importantly prevents userspace from incorrectly (or 
>maliciously) inserting 4K system memory object in some page-table 
>operating in 64K GTT mode.
>

Thanks Matt.
I also, syned offline with Matt a bit on this.
We don't need explicit 'pad_to_size' size. i915 driver is implicitly
padding the size to 2M boundary for LMEM BOs which will apply for
VM_BIND also.
The remaining question is whether we enforce 2M VA alignment for
lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with
just 64K alignment but ensure there is no mixing of 4K and 64K
mappings in same 2M range. I think we can go with 2M alignment
requirement for VM_BIND also. So, no new requirements here for VM_BIND.

I will update the documentation.

Niranjana

>>
>>Niranjana
>>
>>>>
>>>>Regards,
>>>>
>>>>Tvrtko
>>>>
>>>>>
>>>>>Niranjana
>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>Tvrtko
>>>>>>
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_GEM_VM_BIND_READONLY:
>>>>>>>+     * Mapping is read-only.
>>>>>>>+     *
>>>>>>>+     * I915_GEM_VM_BIND_CAPTURE:
>>>>>>>+     * Capture this mapping in the dump upon GPU error.
>>>>>>>+     */
>>>>>>>+    __u64 flags;
>>>>>>>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>>>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>>>+
>>>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>>>this mapping. */
>>>>>>>+    __u64 extensions;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>>>+ *
>>>>>>>+ * This structure is passed to VM_UNBIND ioctl and 
>>>>>>>specifies the GPU virtual
>>>>>>>+ * address (VA) range that should be unbound from the 
>>>>>>>device page table of the
>>>>>>>+ * specified address space (VM). The specified VA range 
>>>>>>>must match one of the
>>>>>>>+ * mappings created with the VM_BIND ioctl. TLB is 
>>>>>>>flushed upon unbind
>>>>>>>+ * completion.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_vm_unbind {
>>>>>>>+    /** @vm_id: VM (address space) id to bind */
>>>>>>>+    __u32 vm_id;
>>>>>>>+
>>>>>>>+    /** @rsvd: Reserved for future use; must be zero. */
>>>>>>>+    __u32 rsvd;
>>>>>>>+
>>>>>>>+    /** @start: Virtual Address start to unbind */
>>>>>>>+    __u64 start;
>>>>>>>+
>>>>>>>+    /** @length: Length of mapping to unbind */
>>>>>>>+    __u64 length;
>>>>>>>+
>>>>>>>+    /** @flags: reserved for future usage, currently MBZ */
>>>>>>>+    __u64 flags;
>>>>>>>+
>>>>>>>+    /** @extensions: 0-terminated chain of extensions for 
>>>>>>>this mapping. */
>>>>>>>+    __u64 extensions;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_fence - An input or output 
>>>>>>>fence for the vm_bind
>>>>>>>+ * or the vm_unbind work.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind aync worker will wait for 
>>>>>>>input fence to signal
>>>>>>>+ * before starting the binding or unbinding.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>returned output fence
>>>>>>>+ * after the completion of binding or unbinding.
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_fence {
>>>>>>>+    /** @handle: User's handle for a drm_syncobj to wait 
>>>>>>>on or signal. */
>>>>>>>+    __u32 handle;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_FENCE_WAIT:
>>>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>>>+     * Return bind/unbind completion fence as output
>>>>>>>+     */
>>>>>>>+    __u32 flags;
>>>>>>>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>>>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>>>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>>>(-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline 
>>>>>>>fences for vm_bind
>>>>>>>+ * and vm_unbind.
>>>>>>>+ *
>>>>>>>+ * This structure describes an array of timeline 
>>>>>>>drm_syncobj and associated
>>>>>>>+ * points for timeline variants of drm_syncobj. These 
>>>>>>>timeline 'drm_syncobj's
>>>>>>>+ * can be input or output fences (See struct 
>>>>>>>drm_i915_vm_bind_fence).
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>>>+#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @fence_count: Number of elements in the 
>>>>>>>@handles_ptr & @value_ptr
>>>>>>>+     * arrays.
>>>>>>>+     */
>>>>>>>+    __u64 fence_count;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @handles_ptr: Pointer to an array of struct 
>>>>>>>drm_i915_vm_bind_fence
>>>>>>>+     * of length @fence_count.
>>>>>>>+     */
>>>>>>>+    __u64 handles_ptr;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @values_ptr: Pointer to an array of u64 values of length
>>>>>>>+     * @fence_count.
>>>>>>>+     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>>>>>>+     * timeline drm_syncobj is invalid as it turns a 
>>>>>>>drm_syncobj into a
>>>>>>>+     * binary one.
>>>>>>>+     */
>>>>>>>+    __u64 values_ptr;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_user_fence - An input or 
>>>>>>>output user fence for the
>>>>>>>+ * vm_bind or the vm_unbind work.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind aync worker will wait for the 
>>>>>>>input fence (value at
>>>>>>>+ * @addr to become equal to @val) before starting the 
>>>>>>>binding or unbinding.
>>>>>>>+ *
>>>>>>>+ * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>output fence after
>>>>>>>+ * the completion of binding or unbinding by writing @val 
>>>>>>>to memory location at
>>>>>>>+ * @addr
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_user_fence {
>>>>>>>+    /** @addr: User/Memory fence qword aligned process 
>>>>>>>virtual address */
>>>>>>>+    __u64 addr;
>>>>>>>+
>>>>>>>+    /** @val: User/Memory fence value to be written after 
>>>>>>>bind completion */
>>>>>>>+    __u64 val;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>>>+     * Wait for the input fence before binding/unbinding
>>>>>>>+     *
>>>>>>>+     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>>>+     * Return bind/unbind completion fence as output
>>>>>>>+     */
>>>>>>>+    __u32 flags;
>>>>>>>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>>>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>>>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>>>+    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory 
>>>>>>>fences for vm_bind
>>>>>>>+ * and vm_unbind.
>>>>>>>+ *
>>>>>>>+ * These user fences can be input or output fences
>>>>>>>+ * (See struct drm_i915_vm_bind_user_fence).
>>>>>>>+ */
>>>>>>>+struct drm_i915_vm_bind_ext_user_fence {
>>>>>>>+#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /** @fence_count: Number of elements in the 
>>>>>>>@user_fence_ptr array. */
>>>>>>>+    __u64 fence_count;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @user_fence_ptr: Pointer to an array of
>>>>>>>+     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>>>+     */
>>>>>>>+    __u64 user_fence_ptr;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - 
>>>>>>>Array of batch buffer
>>>>>>>+ * gpu virtual addresses.
>>>>>>>+ *
>>>>>>>+ * In the execbuff ioctl (See struct 
>>>>>>>drm_i915_gem_execbuffer2), this extension
>>>>>>>+ * must always be appended in the VM_BIND mode and it 
>>>>>>>will be an error to
>>>>>>>+ * append this extension in older non-VM_BIND mode.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /** @count: Number of addresses in the addr array. */
>>>>>>>+    __u32 count;
>>>>>>>+
>>>>>>>+    /** @addr: An array of batch gpu virtual addresses. */
>>>>>>>+    __u64 addr[0];
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First 
>>>>>>>level batch completion
>>>>>>>+ * signaling extension.
>>>>>>>+ *
>>>>>>>+ * This extension allows user to attach a user fence 
>>>>>>>(@addr, @value pair) to an
>>>>>>>+ * execbuf to be signaled by the command streamer after 
>>>>>>>the completion of first
>>>>>>>+ * level batch, by writing the @value at specified @addr 
>>>>>>>and triggering an
>>>>>>>+ * interrupt.
>>>>>>>+ * User can either poll for this user fence to signal or 
>>>>>>>can also wait on it
>>>>>>>+ * with i915_gem_wait_user_fence ioctl.
>>>>>>>+ * This is very much usefaul for long running contexts 
>>>>>>>where waiting on dma-fence
>>>>>>>+ * by user (like i915_gem_wait ioctl) is not supported.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>>>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>>>+     *
>>>>>>>+     * Address has to be a valid GPU virtual address at the time of
>>>>>>>+     * first level batch completion.
>>>>>>>+     */
>>>>>>>+    __u64 addr;
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @value: User/Memory fence Value to be written to 
>>>>>>>above address
>>>>>>>+     * after first level batch completes.
>>>>>>>+     */
>>>>>>>+    __u64 value;
>>>>>>>+
>>>>>>>+    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>>>+    __u64 rsvd;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_create_ext_vm_private - Extension 
>>>>>>>to make the object
>>>>>>>+ * private to the specified VM.
>>>>>>>+ *
>>>>>>>+ * See struct drm_i915_gem_create_ext.
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_create_ext_vm_private {
>>>>>>>+#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>>>+    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>+    struct i915_user_extension base;
>>>>>>>+
>>>>>>>+    /** @vm_id: Id of the VM to which the object is private */
>>>>>>>+    __u32 vm_id;
>>>>>>>+};
>>>>>>>+
>>>>>>>+/**
>>>>>>>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>>>>>>+ *
>>>>>>>+ * User/Memory fence can be woken up either by:
>>>>>>>+ *
>>>>>>>+ * 1. GPU context indicated by @ctx_id, or,
>>>>>>>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>>>+ *    @ctx_id is ignored when this flag is set.
>>>>>>>+ *
>>>>>>>+ * Wakeup condition is,
>>>>>>>+ * ``((*addr & mask) op (value & mask))``
>>>>>>>+ *
>>>>>>>+ * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>>>><indefinite_dma_fences>`
>>>>>>>+ */
>>>>>>>+struct drm_i915_gem_wait_user_fence {
>>>>>>>+    /** @extensions: Zero-terminated chain of extensions. */
>>>>>>>+    __u64 extensions;
>>>>>>>+
>>>>>>>+    /** @addr: User/Memory fence address */
>>>>>>>+    __u64 addr;
>>>>>>>+
>>>>>>>+    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>>>+    __u32 ctx_id;
>>>>>>>+
>>>>>>>+    /** @op: Wakeup condition operator */
>>>>>>>+    __u16 op;
>>>>>>>+#define I915_UFENCE_WAIT_EQ      0
>>>>>>>+#define I915_UFENCE_WAIT_NEQ     1
>>>>>>>+#define I915_UFENCE_WAIT_GT      2
>>>>>>>+#define I915_UFENCE_WAIT_GTE     3
>>>>>>>+#define I915_UFENCE_WAIT_LT      4
>>>>>>>+#define I915_UFENCE_WAIT_LTE     5
>>>>>>>+#define I915_UFENCE_WAIT_BEFORE  6
>>>>>>>+#define I915_UFENCE_WAIT_AFTER   7
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @flags: Supported flags are,
>>>>>>>+     *
>>>>>>>+     * I915_UFENCE_WAIT_SOFT:
>>>>>>>+     *
>>>>>>>+     * To be woken up by i915 driver async worker (not by GPU).
>>>>>>>+     *
>>>>>>>+     * I915_UFENCE_WAIT_ABSTIME:
>>>>>>>+     *
>>>>>>>+     * Wait timeout specified as absolute time.
>>>>>>>+     */
>>>>>>>+    __u16 flags;
>>>>>>>+#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>>>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>>>+
>>>>>>>+    /** @value: Wakeup value */
>>>>>>>+    __u64 value;
>>>>>>>+
>>>>>>>+    /** @mask: Wakeup mask */
>>>>>>>+    __u64 mask;
>>>>>>>+#define I915_UFENCE_WAIT_U8     0xffu
>>>>>>>+#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>>>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>>>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>>>+
>>>>>>>+    /**
>>>>>>>+     * @timeout: Wait timeout in nanoseconds.
>>>>>>>+     *
>>>>>>>+     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time 
>>>>>>>timeout is the
>>>>>>>+     * absolute time in nsec.
>>>>>>>+     */
>>>>>>>+    __s64 timeout;
>>>>>>>+};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-09 14:49                           ` Lionel Landwerlin
@ 2022-06-09 19:31                               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-09 19:31 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>
>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>       >
>       >
>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
>       wrote:
>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>       >>>>
>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>       wrote:
>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>       >>>>   >
>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>       >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>       >>>>   >
>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>       >>>>Brost wrote:
>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>       Landwerlin
>       >>>>   wrote:
>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>       wrote:
>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>       >>>>   binding/unbinding
>       >>>>   >       the mapping in an
>       >>>>   >       >> > +async worker. The binding and unbinding will
>       >>>>work like a
>       >>>>   special
>       >>>>   >       GPU engine.
>       >>>>   >       >> > +The binding and unbinding operations are
>       serialized and
>       >>>>   will
>       >>>>   >       wait on specified
>       >>>>   >       >> > +input fences before the operation and will signal
>       the
>       >>>>   output
>       >>>>   >       fences upon the
>       >>>>   >       >> > +completion of the operation. Due to
>       serialization,
>       >>>>   completion of
>       >>>>   >       an operation
>       >>>>   >       >> > +will also indicate that all previous operations
>       >>>>are also
>       >>>>   >       complete.
>       >>>>   >       >>
>       >>>>   >       >> I guess we should avoid saying "will immediately
>       start
>       >>>>   >       binding/unbinding" if
>       >>>>   >       >> there are fences involved.
>       >>>>   >       >>
>       >>>>   >       >> And the fact that it's happening in an async
>       >>>>worker seem to
>       >>>>   imply
>       >>>>   >       it's not
>       >>>>   >       >> immediate.
>       >>>>   >       >>
>       >>>>   >
>       >>>>   >       Ok, will fix.
>       >>>>   >       This was added because in earlier design binding was
>       deferred
>       >>>>   until
>       >>>>   >       next execbuff.
>       >>>>   >       But now it is non-deferred (immediate in that sense).
>       >>>>But yah,
>       >>>>   this is
>       >>>>   >       confusing
>       >>>>   >       and will fix it.
>       >>>>   >
>       >>>>   >       >>
>       >>>>   >       >> I have a question on the behavior of the bind
>       >>>>operation when
>       >>>>   no
>       >>>>   >       input fence
>       >>>>   >       >> is provided. Let say I do :
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence1)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence2)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence3)
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> In what order are the fences going to be signaled?
>       >>>>   >       >>
>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>       >>>>   >       >>
>       >>>>   >       >> Because you wrote "serialized I assume it's : in
>       order
>       >>>>   >       >>
>       >>>>   >
>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>       >>>>bind and
>       >>>>   unbind
>       >>>>   >       will use
>       >>>>   >       the same queue and hence are ordered.
>       >>>>   >
>       >>>>   >       >>
>       >>>>   >       >> One thing I didn't realize is that because we only
>       get one
>       >>>>   >       "VM_BIND" engine,
>       >>>>   >       >> there is a disconnect from the Vulkan specification.
>       >>>>   >       >>
>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>       >>>>per engine.
>       >>>>   >       >>
>       >>>>   >       >> So you could have something like this :
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>       out_fence=fence2)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>       out_fence=fence4)
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> fence1 is not signaled
>       >>>>   >       >>
>       >>>>   >       >> fence3 is signaled
>       >>>>   >       >>
>       >>>>   >       >> So the second VM_BIND will proceed before the
>       >>>>first VM_BIND.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> I guess we can deal with that scenario in
>       >>>>userspace by doing
>       >>>>   the
>       >>>>   >       wait
>       >>>>   >       >> ourselves in one thread per engines.
>       >>>>   >       >>
>       >>>>   >       >> But then it makes the VM_BIND input fences useless.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> Daniel : what do you think? Should be rework this or
>       just
>       >>>>   deal with
>       >>>>   >       wait
>       >>>>   >       >> fences in userspace?
>       >>>>   >       >>
>       >>>>   >       >
>       >>>>   >       >My opinion is rework this but make the ordering via
>       >>>>an engine
>       >>>>   param
>       >>>>   >       optional.
>       >>>>   >       >
>       >>>>   >       >e.g. A VM can be configured so all binds are ordered
>       >>>>within the
>       >>>>   VM
>       >>>>   >       >
>       >>>>   >       >e.g. A VM can be configured so all binds accept an
>       engine
>       >>>>   argument
>       >>>>   >       (in
>       >>>>   >       >the case of the i915 likely this is a gem context
>       >>>>handle) and
>       >>>>   binds
>       >>>>   >       >ordered with respect to that engine.
>       >>>>   >       >
>       >>>>   >       >This gives UMDs options as the later likely consumes
>       >>>>more KMD
>       >>>>   >       resources
>       >>>>   >       >so if a different UMD can live with binds being
>       >>>>ordered within
>       >>>>   the VM
>       >>>>   >       >they can use a mode consuming less resources.
>       >>>>   >       >
>       >>>>   >
>       >>>>   >       I think we need to be careful here if we are looking
>       for some
>       >>>>   out of
>       >>>>   >       (submission) order completion of vm_bind/unbind.
>       >>>>   >       In-order completion means, in a batch of binds and
>       >>>>unbinds to be
>       >>>>   >       completed in-order, user only needs to specify
>       >>>>in-fence for the
>       >>>>   >       first bind/unbind call and the our-fence for the last
>       >>>>   bind/unbind
>       >>>>   >       call. Also, the VA released by an unbind call can be
>       >>>>re-used by
>       >>>>   >       any subsequent bind call in that in-order batch.
>       >>>>   >
>       >>>>   >       These things will break if binding/unbinding were to
>       >>>>be allowed
>       >>>>   to
>       >>>>   >       go out of order (of submission) and user need to be
>       extra
>       >>>>   careful
>       >>>>   >       not to run into pre-mature triggereing of out-fence and
>       bind
>       >>>>   failing
>       >>>>   >       as VA is still in use etc.
>       >>>>   >
>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>       specified
>       >>>>   address
>       >>>>   >       space
>       >>>>   >       (VM). So, the uapi is not engine/context specific.
>       >>>>   >
>       >>>>   >       We can however add a 'queue' to the uapi which can be
>       >>>>one from
>       >>>>   the
>       >>>>   >       pre-defined queues,
>       >>>>   >       I915_VM_BIND_QUEUE_0
>       >>>>   >       I915_VM_BIND_QUEUE_1
>       >>>>   >       ...
>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>       >>>>   >
>       >>>>   >       KMD will spawn an async work queue for each queue which
>       will
>       >>>>   only
>       >>>>   >       bind the mappings on that queue in the order of
>       submission.
>       >>>>   >       User can assign the queue to per engine or anything
>       >>>>like that.
>       >>>>   >
>       >>>>   >       But again here, user need to be careful and not
>       >>>>deadlock these
>       >>>>   >       queues with circular dependency of fences.
>       >>>>   >
>       >>>>   >       I prefer adding this later an as extension based on
>       >>>>whether it
>       >>>>   >       is really helping with the implementation.
>       >>>>   >
>       >>>>   >     I can tell you right now that having everything on a
>       single
>       >>>>   in-order
>       >>>>   >     queue will not get us the perf we want.  What vulkan
>       >>>>really wants
>       >>>>   is one
>       >>>>   >     of two things:
>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>       happen in
>       >>>>   whatever
>       >>>>   >     their dependencies are resolved and we ensure ordering
>       >>>>ourselves
>       >>>>   by
>       >>>>   >     having a syncobj in the VkQueue.
>       >>>>   >      2. The ability to create multiple VM_BIND queues.  We
>       need at
>       >>>>   least 2
>       >>>>   >     but I don't see why there needs to be a limit besides
>       >>>>the limits
>       >>>>   the
>       >>>>   >     i915 API already has on the number of engines.  Vulkan
>       could
>       >>>>   expose
>       >>>>   >     multiple sparse binding queues to the client if it's not
>       >>>>   arbitrarily
>       >>>>   >     limited.
>       >>>>
>       >>>>   Thanks Jason, Lionel.
>       >>>>
>       >>>>   Jason, what are you referring to when you say "limits the i915
>       API
>       >>>>   already
>       >>>>   has on the number of engines"? I am not sure if there is such
>       an uapi
>       >>>>   today.
>       >>>>
>       >>>> There's a limit of something like 64 total engines today based on
>       the
>       >>>> number of bits we can cram into the exec flags in execbuffer2.  I
>       think
>       >>>> someone had an extended version that allowed more but I ripped it
>       out
>       >>>> because no one was using it.  Of course, execbuffer3 might not
>       >>>>have that
>       >>>> problem at all.
>       >>>>
>       >>>
>       >>>Thanks Jason.
>       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>       probably
>       >>>will not have this limiation. So, we need to define a
>       VM_BIND_MAX_QUEUE
>       >>>and somehow export it to user (I am thinking of embedding it in
>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>       meaning 2^n
>       >>>queues.
>       >>
>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
>       execbuf3
>
>     Yup!  That's exactly the limit I was talking about.
>      
>
>       >>will also have. So, we can simply define in vm_bind/unbind
>       structures,
>       >>
>       >>#define I915_VM_BIND_MAX_QUEUE   64
>       >>        __u32 queue;
>       >>
>       >>I think that will keep things simple.
>       >
>       >Hmmm? What does execbuf2 limit has to do with how many engines
>       >hardware can have? I suggest not to do that.
>       >
>       >Change with added this:
>       >
>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>       >               return -EINVAL;
>       >
>       >To context creation needs to be undone and so let users create engine
>       >maps with all hardware engines, and let execbuf3 access them all.
>       >
>
>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>       make it N+1).
>       But, as discussed in other thread of this RFC series, we are planning
>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>       any uapi that limits the number of engines (and hence the vm_bind
>       queues
>       need to be supported).
>
>       If we leave the number of vm_bind queues to be arbitrarily large
>       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>       work_item and a linked list) lookup from the user specified queue
>       index.
>       Other option is to just put some hard limit (say 64 or 65) and use
>       an array of queues in VM (each created upon first use). I prefer this.
>
>     I don't get why a VM_BIND queue is any different from any other queue or
>     userspace-visible kernel object.  But I'll leave those details up to
>     danvet or whoever else might be reviewing the implementation.
>     --Jason
>
>   I kind of agree here. Wouldn't be simpler to have the bind queue created
>   like the others when we build the engine map?
>
>   For userspace it's then just matter of selecting the right queue ID when
>   submitting.
>
>   If there is ever a possibility to have this work on the GPU, it would be
>   all ready.
>

I did sync offline with Matt Brost on this.
We can add a VM_BIND engine class and let user create VM_BIND engines (queues).
The problem is, in i915 engine creating interface is bound to gem_context.
So, in vm_bind ioctl, we would need both context_id and queue_idx for proper
lookup of the user created engine. This is bit ackward as vm_bind is an
interface to VM (address space) and has nothing to do with gem_context.
Another problem is, if two VMs are binding with the same defined engine,
binding on VM1 can get unnecessary blocked by binding on VM2 (which may be
waiting on its in_fence).

So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind
ioctl, and the queues are per VM.

Niranjana

>   Thanks,
>
>   -Lionel
>
>      
>
>       Niranjana
>
>       >Regards,
>       >
>       >Tvrtko
>       >
>       >>
>       >>Niranjana
>       >>
>       >>>
>       >>>>   I am trying to see how many queues we need and don't want it to
>       be
>       >>>>   arbitrarily
>       >>>>   large and unduely blow up memory usage and complexity in i915
>       driver.
>       >>>>
>       >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>       >>>>of cases. I
>       >>>> could imagine a client wanting to create more than 1 sparse
>       >>>>queue in which
>       >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>       >>>>goes, once
>       >>>> you allow two, I don't think the complexity is going up by
>       >>>>allowing N.  As
>       >>>> for memory usage, creating more queues means more memory.  That's
>       a
>       >>>> trade-off that userspace can make.  Again, the expected number
>       >>>>here is 1
>       >>>> or 2 in the vast majority of cases so I don't think you need to
>       worry.
>       >>>
>       >>>Ok, will start with n=3 meaning 8 queues.
>       >>>That would require us create 8 workqueues.
>       >>>We can change 'n' later if required.
>       >>>
>       >>>Niranjana
>       >>>
>       >>>>
>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>       >>>>operations and we
>       >>>>   don't
>       >>>>   >     want any dependencies between them:
>       >>>>   >      1. Immediate.  These happen right after BO creation or
>       >>>>maybe as
>       >>>>   part of
>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>       >>>>don't happen
>       >>>>   on a
>       >>>>   >     queue and we don't want them serialized with anything. 
>       To
>       >>>>   synchronize
>       >>>>   >     with submit, we'll have a syncobj in the VkDevice which
>       is
>       >>>>   signaled by
>       >>>>   >     all immediate bind operations and make submits wait on
>       it.
>       >>>>   >      2. Queued (sparse): These happen on a VkQueue which may
>       be the
>       >>>>   same as
>       >>>>   >     a render/compute queue or may be its own queue.  It's up
>       to us
>       >>>>   what we
>       >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
>       any
>       >>>>   other
>       >>>>   >     queue.  Operations on it wait on and signal semaphores. 
>       If we
>       >>>>   have a
>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>       >>>>signal just like
>       >>>>   we do
>       >>>>   >     in execbuf().
>       >>>>   >     The important thing is that we don't want one type of
>       >>>>operation to
>       >>>>   block
>       >>>>   >     on the other.  If immediate binds are blocking on sparse
>       binds,
>       >>>>   it's
>       >>>>   >     going to cause over-synchronization issues.
>       >>>>   >     In terms of the internal implementation, I know that
>       >>>>there's going
>       >>>>   to be
>       >>>>   >     a lock on the VM and that we can't actually do these
>       things in
>       >>>>   >     parallel.  That's fine.  Once the dma_fences have
>       signaled and
>       >>>>   we're
>       >>>>
>       >>>>   Thats correct. It is like a single VM_BIND engine with
>       >>>>multiple queues
>       >>>>   feeding to it.
>       >>>>
>       >>>> Right.  As long as the queues themselves are independent and
>       >>>>can block on
>       >>>> dma_fences without holding up other queues, I think we're fine.
>       >>>>
>       >>>>   >     unblocked to do the bind operation, I don't care if
>       >>>>there's a bit
>       >>>>   of
>       >>>>   >     synchronization due to locking.  That's expected.  What
>       >>>>we can't
>       >>>>   afford
>       >>>>   >     to have is an immediate bind operation suddenly blocking
>       on a
>       >>>>   sparse
>       >>>>   >     operation which is blocked on a compute job that's going
>       to run
>       >>>>   for
>       >>>>   >     another 5ms.
>       >>>>
>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
>       the
>       >>>>   VM_BIND
>       >>>>   on other VMs. I am not sure about usecases here, but just
>       wanted to
>       >>>>   clarify.
>       >>>>
>       >>>> Yes, that's what I would expect.
>       >>>> --Jason
>       >>>>
>       >>>>   Niranjana
>       >>>>
>       >>>>   >     For reference, Windows solves this by allowing
>       arbitrarily many
>       >>>>   paging
>       >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>       >>>>design works
>       >>>>   >     pretty well and solves the problems in question. 
>       >>>>Again, we could
>       >>>>   just
>       >>>>   >     make everything out-of-order and require using syncobjs
>       >>>>to order
>       >>>>   things
>       >>>>   >     as userspace wants. That'd be fine too.
>       >>>>   >     One more note while I'm here: danvet said something on
>       >>>>IRC about
>       >>>>   VM_BIND
>       >>>>   >     queues waiting for syncobjs to materialize.  We don't
>       really
>       >>>>   want/need
>       >>>>   >     this.  We already have all the machinery in userspace to
>       handle
>       >>>>   >     wait-before-signal and waiting for syncobj fences to
>       >>>>materialize
>       >>>>   and
>       >>>>   >     that machinery is on by default.  It would actually
>       >>>>take MORE work
>       >>>>   in
>       >>>>   >     Mesa to turn it off and take advantage of the kernel
>       >>>>being able to
>       >>>>   wait
>       >>>>   >     for syncobjs to materialize.  Also, getting that right is
>       >>>>   ridiculously
>       >>>>   >     hard and I really don't want to get it wrong in kernel
>       >>>>space.     When we
>       >>>>   >     do memory fences, wait-before-signal will be a thing.  We
>       don't
>       >>>>   need to
>       >>>>   >     try and make it a thing for syncobj.
>       >>>>   >     --Jason
>       >>>>   >
>       >>>>   >   Thanks Jason,
>       >>>>   >
>       >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
>       have a
>       >>>>   sparse
>       >>>>   >   queue that does not implement either graphics or compute
>       >>>>operations
>       >>>>   :
>       >>>>   >
>       >>>>   >     "While some implementations may include
>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>       >>>>   >     support in queue families that also include
>       >>>>   >
>       >>>>   >      graphics and compute support, other implementations may
>       only
>       >>>>   expose a
>       >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>       >>>>   >
>       >>>>   >      family."
>       >>>>   >
>       >>>>   >   So it can all be all a vm_bind engine that just does
>       bind/unbind
>       >>>>   >   operations.
>       >>>>   >
>       >>>>   >   But yes we need another engine for the immediate/non-sparse
>       >>>>   operations.
>       >>>>   >
>       >>>>   >   -Lionel
>       >>>>   >
>       >>>>   >         >
>       >>>>   >       Daniel, any thoughts?
>       >>>>   >
>       >>>>   >       Niranjana
>       >>>>   >
>       >>>>   >       >Matt
>       >>>>   >       >
>       >>>>   >       >>
>       >>>>   >       >> Sorry I noticed this late.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> -Lionel
>       >>>>   >       >>
>       >>>>   >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-09 19:31                               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-09 19:31 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>
>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>       >
>       >
>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
>       wrote:
>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>       >>>>
>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>       wrote:
>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>       >>>>   >
>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>       >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>       >>>>   >
>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>       >>>>Brost wrote:
>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>       Landwerlin
>       >>>>   wrote:
>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>       wrote:
>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>       >>>>   binding/unbinding
>       >>>>   >       the mapping in an
>       >>>>   >       >> > +async worker. The binding and unbinding will
>       >>>>work like a
>       >>>>   special
>       >>>>   >       GPU engine.
>       >>>>   >       >> > +The binding and unbinding operations are
>       serialized and
>       >>>>   will
>       >>>>   >       wait on specified
>       >>>>   >       >> > +input fences before the operation and will signal
>       the
>       >>>>   output
>       >>>>   >       fences upon the
>       >>>>   >       >> > +completion of the operation. Due to
>       serialization,
>       >>>>   completion of
>       >>>>   >       an operation
>       >>>>   >       >> > +will also indicate that all previous operations
>       >>>>are also
>       >>>>   >       complete.
>       >>>>   >       >>
>       >>>>   >       >> I guess we should avoid saying "will immediately
>       start
>       >>>>   >       binding/unbinding" if
>       >>>>   >       >> there are fences involved.
>       >>>>   >       >>
>       >>>>   >       >> And the fact that it's happening in an async
>       >>>>worker seem to
>       >>>>   imply
>       >>>>   >       it's not
>       >>>>   >       >> immediate.
>       >>>>   >       >>
>       >>>>   >
>       >>>>   >       Ok, will fix.
>       >>>>   >       This was added because in earlier design binding was
>       deferred
>       >>>>   until
>       >>>>   >       next execbuff.
>       >>>>   >       But now it is non-deferred (immediate in that sense).
>       >>>>But yah,
>       >>>>   this is
>       >>>>   >       confusing
>       >>>>   >       and will fix it.
>       >>>>   >
>       >>>>   >       >>
>       >>>>   >       >> I have a question on the behavior of the bind
>       >>>>operation when
>       >>>>   no
>       >>>>   >       input fence
>       >>>>   >       >> is provided. Let say I do :
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence1)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence2)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence3)
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> In what order are the fences going to be signaled?
>       >>>>   >       >>
>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>       >>>>   >       >>
>       >>>>   >       >> Because you wrote "serialized I assume it's : in
>       order
>       >>>>   >       >>
>       >>>>   >
>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>       >>>>bind and
>       >>>>   unbind
>       >>>>   >       will use
>       >>>>   >       the same queue and hence are ordered.
>       >>>>   >
>       >>>>   >       >>
>       >>>>   >       >> One thing I didn't realize is that because we only
>       get one
>       >>>>   >       "VM_BIND" engine,
>       >>>>   >       >> there is a disconnect from the Vulkan specification.
>       >>>>   >       >>
>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>       >>>>per engine.
>       >>>>   >       >>
>       >>>>   >       >> So you could have something like this :
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>       out_fence=fence2)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>       out_fence=fence4)
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> fence1 is not signaled
>       >>>>   >       >>
>       >>>>   >       >> fence3 is signaled
>       >>>>   >       >>
>       >>>>   >       >> So the second VM_BIND will proceed before the
>       >>>>first VM_BIND.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> I guess we can deal with that scenario in
>       >>>>userspace by doing
>       >>>>   the
>       >>>>   >       wait
>       >>>>   >       >> ourselves in one thread per engines.
>       >>>>   >       >>
>       >>>>   >       >> But then it makes the VM_BIND input fences useless.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> Daniel : what do you think? Should be rework this or
>       just
>       >>>>   deal with
>       >>>>   >       wait
>       >>>>   >       >> fences in userspace?
>       >>>>   >       >>
>       >>>>   >       >
>       >>>>   >       >My opinion is rework this but make the ordering via
>       >>>>an engine
>       >>>>   param
>       >>>>   >       optional.
>       >>>>   >       >
>       >>>>   >       >e.g. A VM can be configured so all binds are ordered
>       >>>>within the
>       >>>>   VM
>       >>>>   >       >
>       >>>>   >       >e.g. A VM can be configured so all binds accept an
>       engine
>       >>>>   argument
>       >>>>   >       (in
>       >>>>   >       >the case of the i915 likely this is a gem context
>       >>>>handle) and
>       >>>>   binds
>       >>>>   >       >ordered with respect to that engine.
>       >>>>   >       >
>       >>>>   >       >This gives UMDs options as the later likely consumes
>       >>>>more KMD
>       >>>>   >       resources
>       >>>>   >       >so if a different UMD can live with binds being
>       >>>>ordered within
>       >>>>   the VM
>       >>>>   >       >they can use a mode consuming less resources.
>       >>>>   >       >
>       >>>>   >
>       >>>>   >       I think we need to be careful here if we are looking
>       for some
>       >>>>   out of
>       >>>>   >       (submission) order completion of vm_bind/unbind.
>       >>>>   >       In-order completion means, in a batch of binds and
>       >>>>unbinds to be
>       >>>>   >       completed in-order, user only needs to specify
>       >>>>in-fence for the
>       >>>>   >       first bind/unbind call and the our-fence for the last
>       >>>>   bind/unbind
>       >>>>   >       call. Also, the VA released by an unbind call can be
>       >>>>re-used by
>       >>>>   >       any subsequent bind call in that in-order batch.
>       >>>>   >
>       >>>>   >       These things will break if binding/unbinding were to
>       >>>>be allowed
>       >>>>   to
>       >>>>   >       go out of order (of submission) and user need to be
>       extra
>       >>>>   careful
>       >>>>   >       not to run into pre-mature triggereing of out-fence and
>       bind
>       >>>>   failing
>       >>>>   >       as VA is still in use etc.
>       >>>>   >
>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>       specified
>       >>>>   address
>       >>>>   >       space
>       >>>>   >       (VM). So, the uapi is not engine/context specific.
>       >>>>   >
>       >>>>   >       We can however add a 'queue' to the uapi which can be
>       >>>>one from
>       >>>>   the
>       >>>>   >       pre-defined queues,
>       >>>>   >       I915_VM_BIND_QUEUE_0
>       >>>>   >       I915_VM_BIND_QUEUE_1
>       >>>>   >       ...
>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>       >>>>   >
>       >>>>   >       KMD will spawn an async work queue for each queue which
>       will
>       >>>>   only
>       >>>>   >       bind the mappings on that queue in the order of
>       submission.
>       >>>>   >       User can assign the queue to per engine or anything
>       >>>>like that.
>       >>>>   >
>       >>>>   >       But again here, user need to be careful and not
>       >>>>deadlock these
>       >>>>   >       queues with circular dependency of fences.
>       >>>>   >
>       >>>>   >       I prefer adding this later an as extension based on
>       >>>>whether it
>       >>>>   >       is really helping with the implementation.
>       >>>>   >
>       >>>>   >     I can tell you right now that having everything on a
>       single
>       >>>>   in-order
>       >>>>   >     queue will not get us the perf we want.  What vulkan
>       >>>>really wants
>       >>>>   is one
>       >>>>   >     of two things:
>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>       happen in
>       >>>>   whatever
>       >>>>   >     their dependencies are resolved and we ensure ordering
>       >>>>ourselves
>       >>>>   by
>       >>>>   >     having a syncobj in the VkQueue.
>       >>>>   >      2. The ability to create multiple VM_BIND queues.  We
>       need at
>       >>>>   least 2
>       >>>>   >     but I don't see why there needs to be a limit besides
>       >>>>the limits
>       >>>>   the
>       >>>>   >     i915 API already has on the number of engines.  Vulkan
>       could
>       >>>>   expose
>       >>>>   >     multiple sparse binding queues to the client if it's not
>       >>>>   arbitrarily
>       >>>>   >     limited.
>       >>>>
>       >>>>   Thanks Jason, Lionel.
>       >>>>
>       >>>>   Jason, what are you referring to when you say "limits the i915
>       API
>       >>>>   already
>       >>>>   has on the number of engines"? I am not sure if there is such
>       an uapi
>       >>>>   today.
>       >>>>
>       >>>> There's a limit of something like 64 total engines today based on
>       the
>       >>>> number of bits we can cram into the exec flags in execbuffer2.  I
>       think
>       >>>> someone had an extended version that allowed more but I ripped it
>       out
>       >>>> because no one was using it.  Of course, execbuffer3 might not
>       >>>>have that
>       >>>> problem at all.
>       >>>>
>       >>>
>       >>>Thanks Jason.
>       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>       probably
>       >>>will not have this limiation. So, we need to define a
>       VM_BIND_MAX_QUEUE
>       >>>and somehow export it to user (I am thinking of embedding it in
>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>       meaning 2^n
>       >>>queues.
>       >>
>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
>       execbuf3
>
>     Yup!  That's exactly the limit I was talking about.
>      
>
>       >>will also have. So, we can simply define in vm_bind/unbind
>       structures,
>       >>
>       >>#define I915_VM_BIND_MAX_QUEUE   64
>       >>        __u32 queue;
>       >>
>       >>I think that will keep things simple.
>       >
>       >Hmmm? What does execbuf2 limit has to do with how many engines
>       >hardware can have? I suggest not to do that.
>       >
>       >Change with added this:
>       >
>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>       >               return -EINVAL;
>       >
>       >To context creation needs to be undone and so let users create engine
>       >maps with all hardware engines, and let execbuf3 access them all.
>       >
>
>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>       make it N+1).
>       But, as discussed in other thread of this RFC series, we are planning
>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>       any uapi that limits the number of engines (and hence the vm_bind
>       queues
>       need to be supported).
>
>       If we leave the number of vm_bind queues to be arbitrarily large
>       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>       work_item and a linked list) lookup from the user specified queue
>       index.
>       Other option is to just put some hard limit (say 64 or 65) and use
>       an array of queues in VM (each created upon first use). I prefer this.
>
>     I don't get why a VM_BIND queue is any different from any other queue or
>     userspace-visible kernel object.  But I'll leave those details up to
>     danvet or whoever else might be reviewing the implementation.
>     --Jason
>
>   I kind of agree here. Wouldn't be simpler to have the bind queue created
>   like the others when we build the engine map?
>
>   For userspace it's then just matter of selecting the right queue ID when
>   submitting.
>
>   If there is ever a possibility to have this work on the GPU, it would be
>   all ready.
>

I did sync offline with Matt Brost on this.
We can add a VM_BIND engine class and let user create VM_BIND engines (queues).
The problem is, in i915 engine creating interface is bound to gem_context.
So, in vm_bind ioctl, we would need both context_id and queue_idx for proper
lookup of the user created engine. This is bit ackward as vm_bind is an
interface to VM (address space) and has nothing to do with gem_context.
Another problem is, if two VMs are binding with the same defined engine,
binding on VM1 can get unnecessary blocked by binding on VM2 (which may be
waiting on its in_fence).

So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind
ioctl, and the queues are per VM.

Niranjana

>   Thanks,
>
>   -Lionel
>
>      
>
>       Niranjana
>
>       >Regards,
>       >
>       >Tvrtko
>       >
>       >>
>       >>Niranjana
>       >>
>       >>>
>       >>>>   I am trying to see how many queues we need and don't want it to
>       be
>       >>>>   arbitrarily
>       >>>>   large and unduely blow up memory usage and complexity in i915
>       driver.
>       >>>>
>       >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>       >>>>of cases. I
>       >>>> could imagine a client wanting to create more than 1 sparse
>       >>>>queue in which
>       >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>       >>>>goes, once
>       >>>> you allow two, I don't think the complexity is going up by
>       >>>>allowing N.  As
>       >>>> for memory usage, creating more queues means more memory.  That's
>       a
>       >>>> trade-off that userspace can make.  Again, the expected number
>       >>>>here is 1
>       >>>> or 2 in the vast majority of cases so I don't think you need to
>       worry.
>       >>>
>       >>>Ok, will start with n=3 meaning 8 queues.
>       >>>That would require us create 8 workqueues.
>       >>>We can change 'n' later if required.
>       >>>
>       >>>Niranjana
>       >>>
>       >>>>
>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>       >>>>operations and we
>       >>>>   don't
>       >>>>   >     want any dependencies between them:
>       >>>>   >      1. Immediate.  These happen right after BO creation or
>       >>>>maybe as
>       >>>>   part of
>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>       >>>>don't happen
>       >>>>   on a
>       >>>>   >     queue and we don't want them serialized with anything. 
>       To
>       >>>>   synchronize
>       >>>>   >     with submit, we'll have a syncobj in the VkDevice which
>       is
>       >>>>   signaled by
>       >>>>   >     all immediate bind operations and make submits wait on
>       it.
>       >>>>   >      2. Queued (sparse): These happen on a VkQueue which may
>       be the
>       >>>>   same as
>       >>>>   >     a render/compute queue or may be its own queue.  It's up
>       to us
>       >>>>   what we
>       >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
>       any
>       >>>>   other
>       >>>>   >     queue.  Operations on it wait on and signal semaphores. 
>       If we
>       >>>>   have a
>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>       >>>>signal just like
>       >>>>   we do
>       >>>>   >     in execbuf().
>       >>>>   >     The important thing is that we don't want one type of
>       >>>>operation to
>       >>>>   block
>       >>>>   >     on the other.  If immediate binds are blocking on sparse
>       binds,
>       >>>>   it's
>       >>>>   >     going to cause over-synchronization issues.
>       >>>>   >     In terms of the internal implementation, I know that
>       >>>>there's going
>       >>>>   to be
>       >>>>   >     a lock on the VM and that we can't actually do these
>       things in
>       >>>>   >     parallel.  That's fine.  Once the dma_fences have
>       signaled and
>       >>>>   we're
>       >>>>
>       >>>>   Thats correct. It is like a single VM_BIND engine with
>       >>>>multiple queues
>       >>>>   feeding to it.
>       >>>>
>       >>>> Right.  As long as the queues themselves are independent and
>       >>>>can block on
>       >>>> dma_fences without holding up other queues, I think we're fine.
>       >>>>
>       >>>>   >     unblocked to do the bind operation, I don't care if
>       >>>>there's a bit
>       >>>>   of
>       >>>>   >     synchronization due to locking.  That's expected.  What
>       >>>>we can't
>       >>>>   afford
>       >>>>   >     to have is an immediate bind operation suddenly blocking
>       on a
>       >>>>   sparse
>       >>>>   >     operation which is blocked on a compute job that's going
>       to run
>       >>>>   for
>       >>>>   >     another 5ms.
>       >>>>
>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
>       the
>       >>>>   VM_BIND
>       >>>>   on other VMs. I am not sure about usecases here, but just
>       wanted to
>       >>>>   clarify.
>       >>>>
>       >>>> Yes, that's what I would expect.
>       >>>> --Jason
>       >>>>
>       >>>>   Niranjana
>       >>>>
>       >>>>   >     For reference, Windows solves this by allowing
>       arbitrarily many
>       >>>>   paging
>       >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>       >>>>design works
>       >>>>   >     pretty well and solves the problems in question. 
>       >>>>Again, we could
>       >>>>   just
>       >>>>   >     make everything out-of-order and require using syncobjs
>       >>>>to order
>       >>>>   things
>       >>>>   >     as userspace wants. That'd be fine too.
>       >>>>   >     One more note while I'm here: danvet said something on
>       >>>>IRC about
>       >>>>   VM_BIND
>       >>>>   >     queues waiting for syncobjs to materialize.  We don't
>       really
>       >>>>   want/need
>       >>>>   >     this.  We already have all the machinery in userspace to
>       handle
>       >>>>   >     wait-before-signal and waiting for syncobj fences to
>       >>>>materialize
>       >>>>   and
>       >>>>   >     that machinery is on by default.  It would actually
>       >>>>take MORE work
>       >>>>   in
>       >>>>   >     Mesa to turn it off and take advantage of the kernel
>       >>>>being able to
>       >>>>   wait
>       >>>>   >     for syncobjs to materialize.  Also, getting that right is
>       >>>>   ridiculously
>       >>>>   >     hard and I really don't want to get it wrong in kernel
>       >>>>space.     When we
>       >>>>   >     do memory fences, wait-before-signal will be a thing.  We
>       don't
>       >>>>   need to
>       >>>>   >     try and make it a thing for syncobj.
>       >>>>   >     --Jason
>       >>>>   >
>       >>>>   >   Thanks Jason,
>       >>>>   >
>       >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
>       have a
>       >>>>   sparse
>       >>>>   >   queue that does not implement either graphics or compute
>       >>>>operations
>       >>>>   :
>       >>>>   >
>       >>>>   >     "While some implementations may include
>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>       >>>>   >     support in queue families that also include
>       >>>>   >
>       >>>>   >      graphics and compute support, other implementations may
>       only
>       >>>>   expose a
>       >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>       >>>>   >
>       >>>>   >      family."
>       >>>>   >
>       >>>>   >   So it can all be all a vm_bind engine that just does
>       bind/unbind
>       >>>>   >   operations.
>       >>>>   >
>       >>>>   >   But yes we need another engine for the immediate/non-sparse
>       >>>>   operations.
>       >>>>   >
>       >>>>   >   -Lionel
>       >>>>   >
>       >>>>   >         >
>       >>>>   >       Daniel, any thoughts?
>       >>>>   >
>       >>>>   >       Niranjana
>       >>>>   >
>       >>>>   >       >Matt
>       >>>>   >       >
>       >>>>   >       >>
>       >>>>   >       >> Sorry I noticed this late.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> -Lionel
>       >>>>   >       >>
>       >>>>   >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 2/3] drm/i915: Update i915 uapi documentation
  2022-06-08 11:24     ` [Intel-gfx] " Matthew Auld
@ 2022-06-10  1:43       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-10  1:43 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Matthew Brost, Intel Graphics Development, ML dri-devel,
	Thomas Hellström, Chris Wilson, Jason Ekstrand,
	Daniel Vetter, Christian König

On Wed, Jun 08, 2022 at 12:24:04PM +0100, Matthew Auld wrote:
>On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura
><niranjana.vishwanathapura@intel.com> wrote:
>>
>> Add some missing i915 upai documentation which the new
>> i915 VM_BIND feature documentation will be refer to.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
>>  1 file changed, 116 insertions(+), 37 deletions(-)
>>
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index a2def7b27009..8c834a31b56f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
>>
>>  /* Must be kept compact -- no holes and well documented */
>>
>> +/**
>> + * typedef drm_i915_getparam_t - Driver parameter query structure.
>
>This one looks funny in the rendered html for some reason, since it
>doesn't seem to emit the @param and @value, I guess it doesn't really
>understand typedef <struct> ?
>
>Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?

Thanks Matt.
Yah, there doesn't seems to be a good way to add kernel doc for this
kind of declaration. 'struct drm_i915_getparam' also didn't help.
I was able to fix it by first defining the structure and then adding
a typedef for it. Not sure if that has any value, but at least we can
get kernel doc for that.

>
>> + */
>>  typedef struct drm_i915_getparam {
>> +       /** @param: Driver parameter to query. */
>>         __s32 param;
>> -       /*
>> +
>> +       /**
>> +        * @value: Address of memory where queried value should be put.
>> +        *
>>          * WARNING: Using pointers instead of fixed-size u64 means we need to write
>>          * compat32 code. Don't repeat this mistake.
>>          */
>> @@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
>>         __u64 rsvd2;
>>  };
>>
>> +/**
>> + * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
>
>s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.
>
>> + * ioctl.
>> + *
>> + * The request will wait for input fence to signal before submission.
>> + *
>> + * The returned output fence will be signaled after the completion of the
>> + * request.
>> + */
>>  struct drm_i915_gem_exec_fence {
>> -       /**
>> -        * User's handle for a drm_syncobj to wait on or signal.
>> -        */
>> +       /** @handle: User's handle for a drm_syncobj to wait on or signal. */
>>         __u32 handle;
>>
>> +       /**
>> +        * @flags: Supported flags are,
>
>are:
>
>> +        *
>> +        * I915_EXEC_FENCE_WAIT:
>> +        * Wait for the input fence before request submission.
>> +        *
>> +        * I915_EXEC_FENCE_SIGNAL:
>> +        * Return request completion fence as output
>> +        */
>> +       __u32 flags;
>>  #define I915_EXEC_FENCE_WAIT            (1<<0)
>>  #define I915_EXEC_FENCE_SIGNAL          (1<<1)
>>  #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
>> -       __u32 flags;
>>  };
>>
>> -/*
>> - * See drm_i915_gem_execbuffer_ext_timeline_fences.
>> - */
>> -#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
>> -
>> -/*
>> +/**
>> + * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
>> + * for execbuff.
>> + *
>>   * This structure describes an array of drm_syncobj and associated points for
>>   * timeline variants of drm_syncobj. It is invalid to append this structure to
>>   * the execbuf if I915_EXEC_FENCE_ARRAY is set.
>>   */
>>  struct drm_i915_gem_execbuffer_ext_timeline_fences {
>> +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
>> +       /** @base: Extension link. See struct i915_user_extension. */
>>         struct i915_user_extension base;
>>
>>         /**
>> -        * Number of element in the handles_ptr & value_ptr arrays.
>> +        * @fence_count: Number of element in the @handles_ptr & @value_ptr
>
>s/element/elements/
>
>> +        * arrays.
>>          */
>>         __u64 fence_count;
>>
>>         /**
>> -        * Pointer to an array of struct drm_i915_gem_exec_fence of length
>> -        * fence_count.
>> +        * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
>> +        * of length @fence_count.
>>          */
>>         __u64 handles_ptr;
>>
>>         /**
>> -        * Pointer to an array of u64 values of length fence_count. Values
>> -        * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
>> -        * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
>> +        * @values_ptr: Pointer to an array of u64 values of length
>> +        * @fence_count.
>> +        * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>> +        * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>> +        * binary one.
>>          */
>>         __u64 values_ptr;
>>  };
>>
>> +/**
>> + * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
>> + */
>>  struct drm_i915_gem_execbuffer2 {
>> -       /**
>> -        * List of gem_exec_object2 structs
>> -        */
>> +       /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
>>         __u64 buffers_ptr;
>> +
>> +       /** @buffer_count: Number of elements in @buffers_ptr array */
>>         __u32 buffer_count;
>>
>> -       /** Offset in the batchbuffer to start execution from. */
>> +       /**
>> +        * @batch_start_offset: Offset in the batchbuffer to start execution
>> +        * from.
>> +        */
>>         __u32 batch_start_offset;
>> -       /** Bytes used in batchbuffer from batch_start_offset */
>> +
>> +       /** @batch_len: Bytes used in batchbuffer from batch_start_offset */
>
>"Length in bytes of the batchbuffer, otherwise assumed to be the
>object size if zero, starting from the @batch_start_offset."
>
>>         __u32 batch_len;
>> +
>> +       /** @DR1: deprecated */
>>         __u32 DR1;
>> +
>> +       /** @DR4: deprecated */
>>         __u32 DR4;
>> +
>> +       /** @num_cliprects: See @cliprects_ptr */
>>         __u32 num_cliprects;
>> +
>>         /**
>> -        * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
>> -        * & I915_EXEC_USE_EXTENSIONS are not set.
>> +        * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
>> +        *
>> +        * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
>> +        * I915_EXEC_USE_EXTENSIONS flags are not set.
>>          *
>>          * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
>> -        * of struct drm_i915_gem_exec_fence and num_cliprects is the length
>> -        * of the array.
>> +        * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
>> +        * array.
>>          *
>>          * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
>> -        * single struct i915_user_extension and num_cliprects is 0.
>> +        * single &i915_user_extension and num_cliprects is 0.
>>          */
>>         __u64 cliprects_ptr;
>> +
>> +       /** @flags: Execbuff flags */
>
>s/Execbuff/Execbuf/
>
>Could maybe document the I915_EXEC_* also, or maybe not ;)
>

We no longer need to refer to execbuf2 as vm_bind will have its own
new execbuf3. But will keep the already added execbuf2 documentation.

>> +       __u64 flags;
>>  #define I915_EXEC_RING_MASK              (0x3f)
>>  #define I915_EXEC_DEFAULT                (0<<0)
>>  #define I915_EXEC_RENDER                 (1<<0)
>> @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
>>  #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
>>  #define I915_EXEC_CONSTANTS_ABSOLUTE   (1<<6)
>>  #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
>> -       __u64 flags;
>> -       __u64 rsvd1; /* now used for context info */
>> -       __u64 rsvd2;
>> -};
>>
>>  /** Resets the SO write offset registers for transform feedback on gen7. */
>>  #define I915_EXEC_GEN7_SOL_RESET       (1<<8)
>> @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
>>   * drm_i915_gem_execbuffer_ext enum.
>>   */
>>  #define I915_EXEC_USE_EXTENSIONS       (1 << 21)
>> -
>>  #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
>>
>> +       /** @rsvd1: Context id */
>> +       __u64 rsvd1;
>> +
>> +       /**
>> +        * @rsvd2: in and out sync_file file descriptors.
>> +        *
>> +        * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
>> +        * lower 32 bits of this field will have the in sync_file fd (input).
>> +        *
>> +        * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
>> +        * field will have the out sync_file fd (output).
>> +        */
>> +       __u64 rsvd2;
>> +};
>> +
>>  #define I915_EXEC_CONTEXT_ID_MASK      (0xffffffff)
>>  #define i915_execbuffer2_set_context_id(eb2, context) \
>>         (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
>> @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
>>         __u32 pad;
>>  };
>>
>> +/**
>> + * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
>> + */
>>  struct drm_i915_gem_context_create_ext {
>> -       __u32 ctx_id; /* output: id of new context*/
>> +       /** @ctx_id: Id of the created context (output) */
>> +       __u32 ctx_id;
>> +
>> +       /**
>> +        * @flags: Supported flags are,
>
>are:
>
>> +        *
>> +        * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
>> +        *
>> +        * Extensions may be appended to this structure and driver must check
>> +        * for those.
>
>Maybe add "See @extensions.", and then....
>
>> +        *
>> +        * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
>> +        *
>> +        * Created context will have single timeline.
>> +        */
>>         __u32 flags;
>>  #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS       (1u << 0)
>>  #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE      (1u << 1)
>>  #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
>>         (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
>> +
>> +       /** @extensions: Zero-terminated chain of extensions. */
>
>...here perhaps list the extensions, and maybe also move the #define
>for each here? See for example @extensions in drm_i915_gem_create_ext.
>

Ok, will address all your comments above.

Niranjana

>Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>
>>         __u64 extensions;
>>  };
>>
>> @@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
>>         __u32 pad;
>>  };
>>
>> -/*
>> +/**
>> + * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
>> + *
>>   * DRM_I915_GEM_VM_CREATE -
>>   *
>>   * Create a new virtual memory address space (ppGTT) for use within a context
>> @@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
>>   * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
>>   * returned in the outparam @id.
>>   *
>> - * No flags are defined, with all bits reserved and must be zero.
>> - *
>>   * An extension chain maybe provided, starting with @extensions, and terminated
>>   * by the @next_extension being 0. Currently, no extensions are defined.
>>   *
>>   * DRM_I915_GEM_VM_DESTROY -
>>   *
>> - * Destroys a previously created VM id, specified in @id.
>> + * Destroys a previously created VM id, specified in @vm_id.
>>   *
>>   * No extensions or flags are allowed currently, and so must be zero.
>>   */
>>  struct drm_i915_gem_vm_control {
>> +       /** @extensions: Zero-terminated chain of extensions. */
>>         __u64 extensions;
>> +
>> +       /** @flags: reserved for future usage, currently MBZ */
>>         __u32 flags;
>> +
>> +       /** @vm_id: Id of the VM created or to be destroyed */
>>         __u32 vm_id;
>>  };
>>
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 2/3] drm/i915: Update i915 uapi documentation
@ 2022-06-10  1:43       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-10  1:43 UTC (permalink / raw)
  To: Matthew Auld
  Cc: Intel Graphics Development, ML dri-devel, Thomas Hellström,
	Chris Wilson, Daniel Vetter, Christian König

On Wed, Jun 08, 2022 at 12:24:04PM +0100, Matthew Auld wrote:
>On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura
><niranjana.vishwanathapura@intel.com> wrote:
>>
>> Add some missing i915 upai documentation which the new
>> i915 VM_BIND feature documentation will be refer to.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
>>  1 file changed, 116 insertions(+), 37 deletions(-)
>>
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index a2def7b27009..8c834a31b56f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
>>
>>  /* Must be kept compact -- no holes and well documented */
>>
>> +/**
>> + * typedef drm_i915_getparam_t - Driver parameter query structure.
>
>This one looks funny in the rendered html for some reason, since it
>doesn't seem to emit the @param and @value, I guess it doesn't really
>understand typedef <struct> ?
>
>Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?

Thanks Matt.
Yah, there doesn't seems to be a good way to add kernel doc for this
kind of declaration. 'struct drm_i915_getparam' also didn't help.
I was able to fix it by first defining the structure and then adding
a typedef for it. Not sure if that has any value, but at least we can
get kernel doc for that.

>
>> + */
>>  typedef struct drm_i915_getparam {
>> +       /** @param: Driver parameter to query. */
>>         __s32 param;
>> -       /*
>> +
>> +       /**
>> +        * @value: Address of memory where queried value should be put.
>> +        *
>>          * WARNING: Using pointers instead of fixed-size u64 means we need to write
>>          * compat32 code. Don't repeat this mistake.
>>          */
>> @@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
>>         __u64 rsvd2;
>>  };
>>
>> +/**
>> + * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
>
>s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.
>
>> + * ioctl.
>> + *
>> + * The request will wait for input fence to signal before submission.
>> + *
>> + * The returned output fence will be signaled after the completion of the
>> + * request.
>> + */
>>  struct drm_i915_gem_exec_fence {
>> -       /**
>> -        * User's handle for a drm_syncobj to wait on or signal.
>> -        */
>> +       /** @handle: User's handle for a drm_syncobj to wait on or signal. */
>>         __u32 handle;
>>
>> +       /**
>> +        * @flags: Supported flags are,
>
>are:
>
>> +        *
>> +        * I915_EXEC_FENCE_WAIT:
>> +        * Wait for the input fence before request submission.
>> +        *
>> +        * I915_EXEC_FENCE_SIGNAL:
>> +        * Return request completion fence as output
>> +        */
>> +       __u32 flags;
>>  #define I915_EXEC_FENCE_WAIT            (1<<0)
>>  #define I915_EXEC_FENCE_SIGNAL          (1<<1)
>>  #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
>> -       __u32 flags;
>>  };
>>
>> -/*
>> - * See drm_i915_gem_execbuffer_ext_timeline_fences.
>> - */
>> -#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
>> -
>> -/*
>> +/**
>> + * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
>> + * for execbuff.
>> + *
>>   * This structure describes an array of drm_syncobj and associated points for
>>   * timeline variants of drm_syncobj. It is invalid to append this structure to
>>   * the execbuf if I915_EXEC_FENCE_ARRAY is set.
>>   */
>>  struct drm_i915_gem_execbuffer_ext_timeline_fences {
>> +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
>> +       /** @base: Extension link. See struct i915_user_extension. */
>>         struct i915_user_extension base;
>>
>>         /**
>> -        * Number of element in the handles_ptr & value_ptr arrays.
>> +        * @fence_count: Number of element in the @handles_ptr & @value_ptr
>
>s/element/elements/
>
>> +        * arrays.
>>          */
>>         __u64 fence_count;
>>
>>         /**
>> -        * Pointer to an array of struct drm_i915_gem_exec_fence of length
>> -        * fence_count.
>> +        * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
>> +        * of length @fence_count.
>>          */
>>         __u64 handles_ptr;
>>
>>         /**
>> -        * Pointer to an array of u64 values of length fence_count. Values
>> -        * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
>> -        * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
>> +        * @values_ptr: Pointer to an array of u64 values of length
>> +        * @fence_count.
>> +        * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>> +        * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>> +        * binary one.
>>          */
>>         __u64 values_ptr;
>>  };
>>
>> +/**
>> + * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
>> + */
>>  struct drm_i915_gem_execbuffer2 {
>> -       /**
>> -        * List of gem_exec_object2 structs
>> -        */
>> +       /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
>>         __u64 buffers_ptr;
>> +
>> +       /** @buffer_count: Number of elements in @buffers_ptr array */
>>         __u32 buffer_count;
>>
>> -       /** Offset in the batchbuffer to start execution from. */
>> +       /**
>> +        * @batch_start_offset: Offset in the batchbuffer to start execution
>> +        * from.
>> +        */
>>         __u32 batch_start_offset;
>> -       /** Bytes used in batchbuffer from batch_start_offset */
>> +
>> +       /** @batch_len: Bytes used in batchbuffer from batch_start_offset */
>
>"Length in bytes of the batchbuffer, otherwise assumed to be the
>object size if zero, starting from the @batch_start_offset."
>
>>         __u32 batch_len;
>> +
>> +       /** @DR1: deprecated */
>>         __u32 DR1;
>> +
>> +       /** @DR4: deprecated */
>>         __u32 DR4;
>> +
>> +       /** @num_cliprects: See @cliprects_ptr */
>>         __u32 num_cliprects;
>> +
>>         /**
>> -        * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
>> -        * & I915_EXEC_USE_EXTENSIONS are not set.
>> +        * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
>> +        *
>> +        * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
>> +        * I915_EXEC_USE_EXTENSIONS flags are not set.
>>          *
>>          * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
>> -        * of struct drm_i915_gem_exec_fence and num_cliprects is the length
>> -        * of the array.
>> +        * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
>> +        * array.
>>          *
>>          * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
>> -        * single struct i915_user_extension and num_cliprects is 0.
>> +        * single &i915_user_extension and num_cliprects is 0.
>>          */
>>         __u64 cliprects_ptr;
>> +
>> +       /** @flags: Execbuff flags */
>
>s/Execbuff/Execbuf/
>
>Could maybe document the I915_EXEC_* also, or maybe not ;)
>

We no longer need to refer to execbuf2 as vm_bind will have its own
new execbuf3. But will keep the already added execbuf2 documentation.

>> +       __u64 flags;
>>  #define I915_EXEC_RING_MASK              (0x3f)
>>  #define I915_EXEC_DEFAULT                (0<<0)
>>  #define I915_EXEC_RENDER                 (1<<0)
>> @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
>>  #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
>>  #define I915_EXEC_CONSTANTS_ABSOLUTE   (1<<6)
>>  #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
>> -       __u64 flags;
>> -       __u64 rsvd1; /* now used for context info */
>> -       __u64 rsvd2;
>> -};
>>
>>  /** Resets the SO write offset registers for transform feedback on gen7. */
>>  #define I915_EXEC_GEN7_SOL_RESET       (1<<8)
>> @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
>>   * drm_i915_gem_execbuffer_ext enum.
>>   */
>>  #define I915_EXEC_USE_EXTENSIONS       (1 << 21)
>> -
>>  #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
>>
>> +       /** @rsvd1: Context id */
>> +       __u64 rsvd1;
>> +
>> +       /**
>> +        * @rsvd2: in and out sync_file file descriptors.
>> +        *
>> +        * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
>> +        * lower 32 bits of this field will have the in sync_file fd (input).
>> +        *
>> +        * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
>> +        * field will have the out sync_file fd (output).
>> +        */
>> +       __u64 rsvd2;
>> +};
>> +
>>  #define I915_EXEC_CONTEXT_ID_MASK      (0xffffffff)
>>  #define i915_execbuffer2_set_context_id(eb2, context) \
>>         (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
>> @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
>>         __u32 pad;
>>  };
>>
>> +/**
>> + * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
>> + */
>>  struct drm_i915_gem_context_create_ext {
>> -       __u32 ctx_id; /* output: id of new context*/
>> +       /** @ctx_id: Id of the created context (output) */
>> +       __u32 ctx_id;
>> +
>> +       /**
>> +        * @flags: Supported flags are,
>
>are:
>
>> +        *
>> +        * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
>> +        *
>> +        * Extensions may be appended to this structure and driver must check
>> +        * for those.
>
>Maybe add "See @extensions.", and then....
>
>> +        *
>> +        * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
>> +        *
>> +        * Created context will have single timeline.
>> +        */
>>         __u32 flags;
>>  #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS       (1u << 0)
>>  #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE      (1u << 1)
>>  #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
>>         (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
>> +
>> +       /** @extensions: Zero-terminated chain of extensions. */
>
>...here perhaps list the extensions, and maybe also move the #define
>for each here? See for example @extensions in drm_i915_gem_create_ext.
>

Ok, will address all your comments above.

Niranjana

>Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>
>>         __u64 extensions;
>>  };
>>
>> @@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
>>         __u32 pad;
>>  };
>>
>> -/*
>> +/**
>> + * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
>> + *
>>   * DRM_I915_GEM_VM_CREATE -
>>   *
>>   * Create a new virtual memory address space (ppGTT) for use within a context
>> @@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
>>   * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
>>   * returned in the outparam @id.
>>   *
>> - * No flags are defined, with all bits reserved and must be zero.
>> - *
>>   * An extension chain maybe provided, starting with @extensions, and terminated
>>   * by the @next_extension being 0. Currently, no extensions are defined.
>>   *
>>   * DRM_I915_GEM_VM_DESTROY -
>>   *
>> - * Destroys a previously created VM id, specified in @id.
>> + * Destroys a previously created VM id, specified in @vm_id.
>>   *
>>   * No extensions or flags are allowed currently, and so must be zero.
>>   */
>>  struct drm_i915_gem_vm_control {
>> +       /** @extensions: Zero-terminated chain of extensions. */
>>         __u64 extensions;
>> +
>> +       /** @flags: reserved for future usage, currently MBZ */
>>         __u32 flags;
>> +
>> +       /** @vm_id: Id of the VM created or to be destroyed */
>>         __u32 vm_id;
>>  };
>>
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-09 19:31                               ` Niranjana Vishwanathapura
@ 2022-06-10  6:53                                 ` Lionel Landwerlin
  -1 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-10  6:53 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>>
>>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>     <niranjana.vishwanathapura@intel.com> wrote:
>>
>>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>       >
>>       >
>>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>> Vishwanathapura
>>       wrote:
>>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>       >>>>
>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>>       wrote:
>>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>       >>>>   >
>>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>> Vishwanathapura
>>       >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>       >>>>   >
>>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>       >>>>Brost wrote:
>>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>       Landwerlin
>>       >>>>   wrote:
>>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>       wrote:
>>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>       >>>>   binding/unbinding
>>       >>>>   >       the mapping in an
>>       >>>>   >       >> > +async worker. The binding and unbinding will
>>       >>>>work like a
>>       >>>>   special
>>       >>>>   >       GPU engine.
>>       >>>>   >       >> > +The binding and unbinding operations are
>>       serialized and
>>       >>>>   will
>>       >>>>   >       wait on specified
>>       >>>>   >       >> > +input fences before the operation and will 
>> signal
>>       the
>>       >>>>   output
>>       >>>>   >       fences upon the
>>       >>>>   >       >> > +completion of the operation. Due to
>>       serialization,
>>       >>>>   completion of
>>       >>>>   >       an operation
>>       >>>>   >       >> > +will also indicate that all previous 
>> operations
>>       >>>>are also
>>       >>>>   >       complete.
>>       >>>>   >       >>
>>       >>>>   >       >> I guess we should avoid saying "will immediately
>>       start
>>       >>>>   >       binding/unbinding" if
>>       >>>>   >       >> there are fences involved.
>>       >>>>   >       >>
>>       >>>>   >       >> And the fact that it's happening in an async
>>       >>>>worker seem to
>>       >>>>   imply
>>       >>>>   >       it's not
>>       >>>>   >       >> immediate.
>>       >>>>   >       >>
>>       >>>>   >
>>       >>>>   >       Ok, will fix.
>>       >>>>   >       This was added because in earlier design binding 
>> was
>>       deferred
>>       >>>>   until
>>       >>>>   >       next execbuff.
>>       >>>>   >       But now it is non-deferred (immediate in that 
>> sense).
>>       >>>>But yah,
>>       >>>>   this is
>>       >>>>   >       confusing
>>       >>>>   >       and will fix it.
>>       >>>>   >
>>       >>>>   >       >>
>>       >>>>   >       >> I have a question on the behavior of the bind
>>       >>>>operation when
>>       >>>>   no
>>       >>>>   >       input fence
>>       >>>>   >       >> is provided. Let say I do :
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence1)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence2)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence3)
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> In what order are the fences going to be 
>> signaled?
>>       >>>>   >       >>
>>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>       >>>>   >       >>
>>       >>>>   >       >> Because you wrote "serialized I assume it's : in
>>       order
>>       >>>>   >       >>
>>       >>>>   >
>>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note 
>> that
>>       >>>>bind and
>>       >>>>   unbind
>>       >>>>   >       will use
>>       >>>>   >       the same queue and hence are ordered.
>>       >>>>   >
>>       >>>>   >       >>
>>       >>>>   >       >> One thing I didn't realize is that because we 
>> only
>>       get one
>>       >>>>   >       "VM_BIND" engine,
>>       >>>>   >       >> there is a disconnect from the Vulkan 
>> specification.
>>       >>>>   >       >>
>>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>       >>>>per engine.
>>       >>>>   >       >>
>>       >>>>   >       >> So you could have something like this :
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>       out_fence=fence2)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>       out_fence=fence4)
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> fence1 is not signaled
>>       >>>>   >       >>
>>       >>>>   >       >> fence3 is signaled
>>       >>>>   >       >>
>>       >>>>   >       >> So the second VM_BIND will proceed before the
>>       >>>>first VM_BIND.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> I guess we can deal with that scenario in
>>       >>>>userspace by doing
>>       >>>>   the
>>       >>>>   >       wait
>>       >>>>   >       >> ourselves in one thread per engines.
>>       >>>>   >       >>
>>       >>>>   >       >> But then it makes the VM_BIND input fences 
>> useless.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> Daniel : what do you think? Should be rework 
>> this or
>>       just
>>       >>>>   deal with
>>       >>>>   >       wait
>>       >>>>   >       >> fences in userspace?
>>       >>>>   >       >>
>>       >>>>   >       >
>>       >>>>   >       >My opinion is rework this but make the ordering 
>> via
>>       >>>>an engine
>>       >>>>   param
>>       >>>>   >       optional.
>>       >>>>   >       >
>>       >>>>   >       >e.g. A VM can be configured so all binds are 
>> ordered
>>       >>>>within the
>>       >>>>   VM
>>       >>>>   >       >
>>       >>>>   >       >e.g. A VM can be configured so all binds accept an
>>       engine
>>       >>>>   argument
>>       >>>>   >       (in
>>       >>>>   >       >the case of the i915 likely this is a gem context
>>       >>>>handle) and
>>       >>>>   binds
>>       >>>>   >       >ordered with respect to that engine.
>>       >>>>   >       >
>>       >>>>   >       >This gives UMDs options as the later likely 
>> consumes
>>       >>>>more KMD
>>       >>>>   >       resources
>>       >>>>   >       >so if a different UMD can live with binds being
>>       >>>>ordered within
>>       >>>>   the VM
>>       >>>>   >       >they can use a mode consuming less resources.
>>       >>>>   >       >
>>       >>>>   >
>>       >>>>   >       I think we need to be careful here if we are 
>> looking
>>       for some
>>       >>>>   out of
>>       >>>>   >       (submission) order completion of vm_bind/unbind.
>>       >>>>   >       In-order completion means, in a batch of binds and
>>       >>>>unbinds to be
>>       >>>>   >       completed in-order, user only needs to specify
>>       >>>>in-fence for the
>>       >>>>   >       first bind/unbind call and the our-fence for the 
>> last
>>       >>>>   bind/unbind
>>       >>>>   >       call. Also, the VA released by an unbind call 
>> can be
>>       >>>>re-used by
>>       >>>>   >       any subsequent bind call in that in-order batch.
>>       >>>>   >
>>       >>>>   >       These things will break if binding/unbinding 
>> were to
>>       >>>>be allowed
>>       >>>>   to
>>       >>>>   >       go out of order (of submission) and user need to be
>>       extra
>>       >>>>   careful
>>       >>>>   >       not to run into pre-mature triggereing of 
>> out-fence and
>>       bind
>>       >>>>   failing
>>       >>>>   >       as VA is still in use etc.
>>       >>>>   >
>>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>>       specified
>>       >>>>   address
>>       >>>>   >       space
>>       >>>>   >       (VM). So, the uapi is not engine/context specific.
>>       >>>>   >
>>       >>>>   >       We can however add a 'queue' to the uapi which 
>> can be
>>       >>>>one from
>>       >>>>   the
>>       >>>>   >       pre-defined queues,
>>       >>>>   >       I915_VM_BIND_QUEUE_0
>>       >>>>   >       I915_VM_BIND_QUEUE_1
>>       >>>>   >       ...
>>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>       >>>>   >
>>       >>>>   >       KMD will spawn an async work queue for each 
>> queue which
>>       will
>>       >>>>   only
>>       >>>>   >       bind the mappings on that queue in the order of
>>       submission.
>>       >>>>   >       User can assign the queue to per engine or anything
>>       >>>>like that.
>>       >>>>   >
>>       >>>>   >       But again here, user need to be careful and not
>>       >>>>deadlock these
>>       >>>>   >       queues with circular dependency of fences.
>>       >>>>   >
>>       >>>>   >       I prefer adding this later an as extension based on
>>       >>>>whether it
>>       >>>>   >       is really helping with the implementation.
>>       >>>>   >
>>       >>>>   >     I can tell you right now that having everything on a
>>       single
>>       >>>>   in-order
>>       >>>>   >     queue will not get us the perf we want.  What vulkan
>>       >>>>really wants
>>       >>>>   is one
>>       >>>>   >     of two things:
>>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>       happen in
>>       >>>>   whatever
>>       >>>>   >     their dependencies are resolved and we ensure 
>> ordering
>>       >>>>ourselves
>>       >>>>   by
>>       >>>>   >     having a syncobj in the VkQueue.
>>       >>>>   >      2. The ability to create multiple VM_BIND 
>> queues.  We
>>       need at
>>       >>>>   least 2
>>       >>>>   >     but I don't see why there needs to be a limit besides
>>       >>>>the limits
>>       >>>>   the
>>       >>>>   >     i915 API already has on the number of engines.  
>> Vulkan
>>       could
>>       >>>>   expose
>>       >>>>   >     multiple sparse binding queues to the client if 
>> it's not
>>       >>>>   arbitrarily
>>       >>>>   >     limited.
>>       >>>>
>>       >>>>   Thanks Jason, Lionel.
>>       >>>>
>>       >>>>   Jason, what are you referring to when you say "limits 
>> the i915
>>       API
>>       >>>>   already
>>       >>>>   has on the number of engines"? I am not sure if there is 
>> such
>>       an uapi
>>       >>>>   today.
>>       >>>>
>>       >>>> There's a limit of something like 64 total engines today 
>> based on
>>       the
>>       >>>> number of bits we can cram into the exec flags in 
>> execbuffer2.  I
>>       think
>>       >>>> someone had an extended version that allowed more but I 
>> ripped it
>>       out
>>       >>>> because no one was using it.  Of course, execbuffer3 might 
>> not
>>       >>>>have that
>>       >>>> problem at all.
>>       >>>>
>>       >>>
>>       >>>Thanks Jason.
>>       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>>       probably
>>       >>>will not have this limiation. So, we need to define a
>>       VM_BIND_MAX_QUEUE
>>       >>>and somehow export it to user (I am thinking of embedding it in
>>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>       meaning 2^n
>>       >>>queues.
>>       >>
>>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) 
>> which
>>       execbuf3
>>
>>     Yup!  That's exactly the limit I was talking about.
>>
>>       >>will also have. So, we can simply define in vm_bind/unbind
>>       structures,
>>       >>
>>       >>#define I915_VM_BIND_MAX_QUEUE   64
>>       >>        __u32 queue;
>>       >>
>>       >>I think that will keep things simple.
>>       >
>>       >Hmmm? What does execbuf2 limit has to do with how many engines
>>       >hardware can have? I suggest not to do that.
>>       >
>>       >Change with added this:
>>       >
>>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>       >               return -EINVAL;
>>       >
>>       >To context creation needs to be undone and so let users create 
>> engine
>>       >maps with all hardware engines, and let execbuf3 access them all.
>>       >
>>
>>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>> execbuff3 also.
>>       Hence, I was using the same limit for VM_BIND queues (64, or 65 
>> if we
>>       make it N+1).
>>       But, as discussed in other thread of this RFC series, we are 
>> planning
>>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>       any uapi that limits the number of engines (and hence the vm_bind
>>       queues
>>       need to be supported).
>>
>>       If we leave the number of vm_bind queues to be arbitrarily large
>>       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>>       work_item and a linked list) lookup from the user specified queue
>>       index.
>>       Other option is to just put some hard limit (say 64 or 65) and use
>>       an array of queues in VM (each created upon first use). I 
>> prefer this.
>>
>>     I don't get why a VM_BIND queue is any different from any other 
>> queue or
>>     userspace-visible kernel object.  But I'll leave those details up to
>>     danvet or whoever else might be reviewing the implementation.
>>     --Jason
>>
>>   I kind of agree here. Wouldn't be simpler to have the bind queue 
>> created
>>   like the others when we build the engine map?
>>
>>   For userspace it's then just matter of selecting the right queue ID 
>> when
>>   submitting.
>>
>>   If there is ever a possibility to have this work on the GPU, it 
>> would be
>>   all ready.
>>
>
> I did sync offline with Matt Brost on this.
> We can add a VM_BIND engine class and let user create VM_BIND engines 
> (queues).
> The problem is, in i915 engine creating interface is bound to 
> gem_context.
> So, in vm_bind ioctl, we would need both context_id and queue_idx for 
> proper
> lookup of the user created engine. This is bit ackward as vm_bind is an
> interface to VM (address space) and has nothing to do with gem_context.


A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time 
right now : eb->context->vm


> Another problem is, if two VMs are binding with the same defined engine,
> binding on VM1 can get unnecessary blocked by binding on VM2 (which 
> may be
> waiting on its in_fence).


Maybe I'm missing something, but how can you have 2 vm objects with a 
single gem_context right now?


>
> So, my preference here is to just add a 'u32 queue' index in 
> vm_bind/unbind
> ioctl, and the queues are per VM.
>
> Niranjana
>
>>   Thanks,
>>
>>   -Lionel
>>
>>
>>       Niranjana
>>
>>       >Regards,
>>       >
>>       >Tvrtko
>>       >
>>       >>
>>       >>Niranjana
>>       >>
>>       >>>
>>       >>>>   I am trying to see how many queues we need and don't 
>> want it to
>>       be
>>       >>>>   arbitrarily
>>       >>>>   large and unduely blow up memory usage and complexity in 
>> i915
>>       driver.
>>       >>>>
>>       >>>> I expect a Vulkan driver to use at most 2 in the vast 
>> majority
>>       >>>>of cases. I
>>       >>>> could imagine a client wanting to create more than 1 sparse
>>       >>>>queue in which
>>       >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>>       >>>>goes, once
>>       >>>> you allow two, I don't think the complexity is going up by
>>       >>>>allowing N.  As
>>       >>>> for memory usage, creating more queues means more memory.  
>> That's
>>       a
>>       >>>> trade-off that userspace can make. Again, the expected number
>>       >>>>here is 1
>>       >>>> or 2 in the vast majority of cases so I don't think you 
>> need to
>>       worry.
>>       >>>
>>       >>>Ok, will start with n=3 meaning 8 queues.
>>       >>>That would require us create 8 workqueues.
>>       >>>We can change 'n' later if required.
>>       >>>
>>       >>>Niranjana
>>       >>>
>>       >>>>
>>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>       >>>>operations and we
>>       >>>>   don't
>>       >>>>   >     want any dependencies between them:
>>       >>>>   >      1. Immediate.  These happen right after BO 
>> creation or
>>       >>>>maybe as
>>       >>>>   part of
>>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>       >>>>don't happen
>>       >>>>   on a
>>       >>>>   >     queue and we don't want them serialized with 
>> anything.       To
>>       >>>>   synchronize
>>       >>>>   >     with submit, we'll have a syncobj in the VkDevice 
>> which
>>       is
>>       >>>>   signaled by
>>       >>>>   >     all immediate bind operations and make submits 
>> wait on
>>       it.
>>       >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>> which may
>>       be the
>>       >>>>   same as
>>       >>>>   >     a render/compute queue or may be its own queue.  
>> It's up
>>       to us
>>       >>>>   what we
>>       >>>>   >     want to advertise.  From the Vulkan API PoV, this 
>> is like
>>       any
>>       >>>>   other
>>       >>>>   >     queue.  Operations on it wait on and signal 
>> semaphores.       If we
>>       >>>>   have a
>>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>       >>>>signal just like
>>       >>>>   we do
>>       >>>>   >     in execbuf().
>>       >>>>   >     The important thing is that we don't want one type of
>>       >>>>operation to
>>       >>>>   block
>>       >>>>   >     on the other.  If immediate binds are blocking on 
>> sparse
>>       binds,
>>       >>>>   it's
>>       >>>>   >     going to cause over-synchronization issues.
>>       >>>>   >     In terms of the internal implementation, I know that
>>       >>>>there's going
>>       >>>>   to be
>>       >>>>   >     a lock on the VM and that we can't actually do these
>>       things in
>>       >>>>   >     parallel.  That's fine.  Once the dma_fences have
>>       signaled and
>>       >>>>   we're
>>       >>>>
>>       >>>>   Thats correct. It is like a single VM_BIND engine with
>>       >>>>multiple queues
>>       >>>>   feeding to it.
>>       >>>>
>>       >>>> Right.  As long as the queues themselves are independent and
>>       >>>>can block on
>>       >>>> dma_fences without holding up other queues, I think we're 
>> fine.
>>       >>>>
>>       >>>>   >     unblocked to do the bind operation, I don't care if
>>       >>>>there's a bit
>>       >>>>   of
>>       >>>>   >     synchronization due to locking.  That's expected.  
>> What
>>       >>>>we can't
>>       >>>>   afford
>>       >>>>   >     to have is an immediate bind operation suddenly 
>> blocking
>>       on a
>>       >>>>   sparse
>>       >>>>   >     operation which is blocked on a compute job that's 
>> going
>>       to run
>>       >>>>   for
>>       >>>>   >     another 5ms.
>>       >>>>
>>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>> doesn't block
>>       the
>>       >>>>   VM_BIND
>>       >>>>   on other VMs. I am not sure about usecases here, but just
>>       wanted to
>>       >>>>   clarify.
>>       >>>>
>>       >>>> Yes, that's what I would expect.
>>       >>>> --Jason
>>       >>>>
>>       >>>>   Niranjana
>>       >>>>
>>       >>>>   >     For reference, Windows solves this by allowing
>>       arbitrarily many
>>       >>>>   paging
>>       >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>       >>>>design works
>>       >>>>   >     pretty well and solves the problems in question. 
>>       >>>>Again, we could
>>       >>>>   just
>>       >>>>   >     make everything out-of-order and require using 
>> syncobjs
>>       >>>>to order
>>       >>>>   things
>>       >>>>   >     as userspace wants. That'd be fine too.
>>       >>>>   >     One more note while I'm here: danvet said 
>> something on
>>       >>>>IRC about
>>       >>>>   VM_BIND
>>       >>>>   >     queues waiting for syncobjs to materialize.  We don't
>>       really
>>       >>>>   want/need
>>       >>>>   >     this.  We already have all the machinery in 
>> userspace to
>>       handle
>>       >>>>   >     wait-before-signal and waiting for syncobj fences to
>>       >>>>materialize
>>       >>>>   and
>>       >>>>   >     that machinery is on by default.  It would actually
>>       >>>>take MORE work
>>       >>>>   in
>>       >>>>   >     Mesa to turn it off and take advantage of the kernel
>>       >>>>being able to
>>       >>>>   wait
>>       >>>>   >     for syncobjs to materialize. Also, getting that 
>> right is
>>       >>>>   ridiculously
>>       >>>>   >     hard and I really don't want to get it wrong in 
>> kernel
>>       >>>>space.     When we
>>       >>>>   >     do memory fences, wait-before-signal will be a 
>> thing.  We
>>       don't
>>       >>>>   need to
>>       >>>>   >     try and make it a thing for syncobj.
>>       >>>>   >     --Jason
>>       >>>>   >
>>       >>>>   >   Thanks Jason,
>>       >>>>   >
>>       >>>>   >   I missed the bit in the Vulkan spec that we're 
>> allowed to
>>       have a
>>       >>>>   sparse
>>       >>>>   >   queue that does not implement either graphics or 
>> compute
>>       >>>>operations
>>       >>>>   :
>>       >>>>   >
>>       >>>>   >     "While some implementations may include
>>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>       >>>>   >     support in queue families that also include
>>       >>>>   >
>>       >>>>   >      graphics and compute support, other 
>> implementations may
>>       only
>>       >>>>   expose a
>>       >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>       >>>>   >
>>       >>>>   >      family."
>>       >>>>   >
>>       >>>>   >   So it can all be all a vm_bind engine that just does
>>       bind/unbind
>>       >>>>   >   operations.
>>       >>>>   >
>>       >>>>   >   But yes we need another engine for the 
>> immediate/non-sparse
>>       >>>>   operations.
>>       >>>>   >
>>       >>>>   >   -Lionel
>>       >>>>   >
>>       >>>>   >         >
>>       >>>>   >       Daniel, any thoughts?
>>       >>>>   >
>>       >>>>   >       Niranjana
>>       >>>>   >
>>       >>>>   >       >Matt
>>       >>>>   >       >
>>       >>>>   >       >>
>>       >>>>   >       >> Sorry I noticed this late.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> -Lionel
>>       >>>>   >       >>
>>       >>>>   >       >>



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-10  6:53                                 ` Lionel Landwerlin
  0 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-10  6:53 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>>
>>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>     <niranjana.vishwanathapura@intel.com> wrote:
>>
>>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>       >
>>       >
>>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>> Vishwanathapura
>>       wrote:
>>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>       >>>>
>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>>       wrote:
>>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>       >>>>   >
>>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>> Vishwanathapura
>>       >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>       >>>>   >
>>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>       >>>>Brost wrote:
>>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>       Landwerlin
>>       >>>>   wrote:
>>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>       wrote:
>>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>       >>>>   binding/unbinding
>>       >>>>   >       the mapping in an
>>       >>>>   >       >> > +async worker. The binding and unbinding will
>>       >>>>work like a
>>       >>>>   special
>>       >>>>   >       GPU engine.
>>       >>>>   >       >> > +The binding and unbinding operations are
>>       serialized and
>>       >>>>   will
>>       >>>>   >       wait on specified
>>       >>>>   >       >> > +input fences before the operation and will 
>> signal
>>       the
>>       >>>>   output
>>       >>>>   >       fences upon the
>>       >>>>   >       >> > +completion of the operation. Due to
>>       serialization,
>>       >>>>   completion of
>>       >>>>   >       an operation
>>       >>>>   >       >> > +will also indicate that all previous 
>> operations
>>       >>>>are also
>>       >>>>   >       complete.
>>       >>>>   >       >>
>>       >>>>   >       >> I guess we should avoid saying "will immediately
>>       start
>>       >>>>   >       binding/unbinding" if
>>       >>>>   >       >> there are fences involved.
>>       >>>>   >       >>
>>       >>>>   >       >> And the fact that it's happening in an async
>>       >>>>worker seem to
>>       >>>>   imply
>>       >>>>   >       it's not
>>       >>>>   >       >> immediate.
>>       >>>>   >       >>
>>       >>>>   >
>>       >>>>   >       Ok, will fix.
>>       >>>>   >       This was added because in earlier design binding 
>> was
>>       deferred
>>       >>>>   until
>>       >>>>   >       next execbuff.
>>       >>>>   >       But now it is non-deferred (immediate in that 
>> sense).
>>       >>>>But yah,
>>       >>>>   this is
>>       >>>>   >       confusing
>>       >>>>   >       and will fix it.
>>       >>>>   >
>>       >>>>   >       >>
>>       >>>>   >       >> I have a question on the behavior of the bind
>>       >>>>operation when
>>       >>>>   no
>>       >>>>   >       input fence
>>       >>>>   >       >> is provided. Let say I do :
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence1)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence2)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence3)
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> In what order are the fences going to be 
>> signaled?
>>       >>>>   >       >>
>>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>       >>>>   >       >>
>>       >>>>   >       >> Because you wrote "serialized I assume it's : in
>>       order
>>       >>>>   >       >>
>>       >>>>   >
>>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note 
>> that
>>       >>>>bind and
>>       >>>>   unbind
>>       >>>>   >       will use
>>       >>>>   >       the same queue and hence are ordered.
>>       >>>>   >
>>       >>>>   >       >>
>>       >>>>   >       >> One thing I didn't realize is that because we 
>> only
>>       get one
>>       >>>>   >       "VM_BIND" engine,
>>       >>>>   >       >> there is a disconnect from the Vulkan 
>> specification.
>>       >>>>   >       >>
>>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>       >>>>per engine.
>>       >>>>   >       >>
>>       >>>>   >       >> So you could have something like this :
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>       out_fence=fence2)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>       out_fence=fence4)
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> fence1 is not signaled
>>       >>>>   >       >>
>>       >>>>   >       >> fence3 is signaled
>>       >>>>   >       >>
>>       >>>>   >       >> So the second VM_BIND will proceed before the
>>       >>>>first VM_BIND.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> I guess we can deal with that scenario in
>>       >>>>userspace by doing
>>       >>>>   the
>>       >>>>   >       wait
>>       >>>>   >       >> ourselves in one thread per engines.
>>       >>>>   >       >>
>>       >>>>   >       >> But then it makes the VM_BIND input fences 
>> useless.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> Daniel : what do you think? Should be rework 
>> this or
>>       just
>>       >>>>   deal with
>>       >>>>   >       wait
>>       >>>>   >       >> fences in userspace?
>>       >>>>   >       >>
>>       >>>>   >       >
>>       >>>>   >       >My opinion is rework this but make the ordering 
>> via
>>       >>>>an engine
>>       >>>>   param
>>       >>>>   >       optional.
>>       >>>>   >       >
>>       >>>>   >       >e.g. A VM can be configured so all binds are 
>> ordered
>>       >>>>within the
>>       >>>>   VM
>>       >>>>   >       >
>>       >>>>   >       >e.g. A VM can be configured so all binds accept an
>>       engine
>>       >>>>   argument
>>       >>>>   >       (in
>>       >>>>   >       >the case of the i915 likely this is a gem context
>>       >>>>handle) and
>>       >>>>   binds
>>       >>>>   >       >ordered with respect to that engine.
>>       >>>>   >       >
>>       >>>>   >       >This gives UMDs options as the later likely 
>> consumes
>>       >>>>more KMD
>>       >>>>   >       resources
>>       >>>>   >       >so if a different UMD can live with binds being
>>       >>>>ordered within
>>       >>>>   the VM
>>       >>>>   >       >they can use a mode consuming less resources.
>>       >>>>   >       >
>>       >>>>   >
>>       >>>>   >       I think we need to be careful here if we are 
>> looking
>>       for some
>>       >>>>   out of
>>       >>>>   >       (submission) order completion of vm_bind/unbind.
>>       >>>>   >       In-order completion means, in a batch of binds and
>>       >>>>unbinds to be
>>       >>>>   >       completed in-order, user only needs to specify
>>       >>>>in-fence for the
>>       >>>>   >       first bind/unbind call and the our-fence for the 
>> last
>>       >>>>   bind/unbind
>>       >>>>   >       call. Also, the VA released by an unbind call 
>> can be
>>       >>>>re-used by
>>       >>>>   >       any subsequent bind call in that in-order batch.
>>       >>>>   >
>>       >>>>   >       These things will break if binding/unbinding 
>> were to
>>       >>>>be allowed
>>       >>>>   to
>>       >>>>   >       go out of order (of submission) and user need to be
>>       extra
>>       >>>>   careful
>>       >>>>   >       not to run into pre-mature triggereing of 
>> out-fence and
>>       bind
>>       >>>>   failing
>>       >>>>   >       as VA is still in use etc.
>>       >>>>   >
>>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>>       specified
>>       >>>>   address
>>       >>>>   >       space
>>       >>>>   >       (VM). So, the uapi is not engine/context specific.
>>       >>>>   >
>>       >>>>   >       We can however add a 'queue' to the uapi which 
>> can be
>>       >>>>one from
>>       >>>>   the
>>       >>>>   >       pre-defined queues,
>>       >>>>   >       I915_VM_BIND_QUEUE_0
>>       >>>>   >       I915_VM_BIND_QUEUE_1
>>       >>>>   >       ...
>>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>       >>>>   >
>>       >>>>   >       KMD will spawn an async work queue for each 
>> queue which
>>       will
>>       >>>>   only
>>       >>>>   >       bind the mappings on that queue in the order of
>>       submission.
>>       >>>>   >       User can assign the queue to per engine or anything
>>       >>>>like that.
>>       >>>>   >
>>       >>>>   >       But again here, user need to be careful and not
>>       >>>>deadlock these
>>       >>>>   >       queues with circular dependency of fences.
>>       >>>>   >
>>       >>>>   >       I prefer adding this later an as extension based on
>>       >>>>whether it
>>       >>>>   >       is really helping with the implementation.
>>       >>>>   >
>>       >>>>   >     I can tell you right now that having everything on a
>>       single
>>       >>>>   in-order
>>       >>>>   >     queue will not get us the perf we want.  What vulkan
>>       >>>>really wants
>>       >>>>   is one
>>       >>>>   >     of two things:
>>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>       happen in
>>       >>>>   whatever
>>       >>>>   >     their dependencies are resolved and we ensure 
>> ordering
>>       >>>>ourselves
>>       >>>>   by
>>       >>>>   >     having a syncobj in the VkQueue.
>>       >>>>   >      2. The ability to create multiple VM_BIND 
>> queues.  We
>>       need at
>>       >>>>   least 2
>>       >>>>   >     but I don't see why there needs to be a limit besides
>>       >>>>the limits
>>       >>>>   the
>>       >>>>   >     i915 API already has on the number of engines.  
>> Vulkan
>>       could
>>       >>>>   expose
>>       >>>>   >     multiple sparse binding queues to the client if 
>> it's not
>>       >>>>   arbitrarily
>>       >>>>   >     limited.
>>       >>>>
>>       >>>>   Thanks Jason, Lionel.
>>       >>>>
>>       >>>>   Jason, what are you referring to when you say "limits 
>> the i915
>>       API
>>       >>>>   already
>>       >>>>   has on the number of engines"? I am not sure if there is 
>> such
>>       an uapi
>>       >>>>   today.
>>       >>>>
>>       >>>> There's a limit of something like 64 total engines today 
>> based on
>>       the
>>       >>>> number of bits we can cram into the exec flags in 
>> execbuffer2.  I
>>       think
>>       >>>> someone had an extended version that allowed more but I 
>> ripped it
>>       out
>>       >>>> because no one was using it.  Of course, execbuffer3 might 
>> not
>>       >>>>have that
>>       >>>> problem at all.
>>       >>>>
>>       >>>
>>       >>>Thanks Jason.
>>       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>>       probably
>>       >>>will not have this limiation. So, we need to define a
>>       VM_BIND_MAX_QUEUE
>>       >>>and somehow export it to user (I am thinking of embedding it in
>>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>       meaning 2^n
>>       >>>queues.
>>       >>
>>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) 
>> which
>>       execbuf3
>>
>>     Yup!  That's exactly the limit I was talking about.
>>
>>       >>will also have. So, we can simply define in vm_bind/unbind
>>       structures,
>>       >>
>>       >>#define I915_VM_BIND_MAX_QUEUE   64
>>       >>        __u32 queue;
>>       >>
>>       >>I think that will keep things simple.
>>       >
>>       >Hmmm? What does execbuf2 limit has to do with how many engines
>>       >hardware can have? I suggest not to do that.
>>       >
>>       >Change with added this:
>>       >
>>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>       >               return -EINVAL;
>>       >
>>       >To context creation needs to be undone and so let users create 
>> engine
>>       >maps with all hardware engines, and let execbuf3 access them all.
>>       >
>>
>>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>> execbuff3 also.
>>       Hence, I was using the same limit for VM_BIND queues (64, or 65 
>> if we
>>       make it N+1).
>>       But, as discussed in other thread of this RFC series, we are 
>> planning
>>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>       any uapi that limits the number of engines (and hence the vm_bind
>>       queues
>>       need to be supported).
>>
>>       If we leave the number of vm_bind queues to be arbitrarily large
>>       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>>       work_item and a linked list) lookup from the user specified queue
>>       index.
>>       Other option is to just put some hard limit (say 64 or 65) and use
>>       an array of queues in VM (each created upon first use). I 
>> prefer this.
>>
>>     I don't get why a VM_BIND queue is any different from any other 
>> queue or
>>     userspace-visible kernel object.  But I'll leave those details up to
>>     danvet or whoever else might be reviewing the implementation.
>>     --Jason
>>
>>   I kind of agree here. Wouldn't be simpler to have the bind queue 
>> created
>>   like the others when we build the engine map?
>>
>>   For userspace it's then just matter of selecting the right queue ID 
>> when
>>   submitting.
>>
>>   If there is ever a possibility to have this work on the GPU, it 
>> would be
>>   all ready.
>>
>
> I did sync offline with Matt Brost on this.
> We can add a VM_BIND engine class and let user create VM_BIND engines 
> (queues).
> The problem is, in i915 engine creating interface is bound to 
> gem_context.
> So, in vm_bind ioctl, we would need both context_id and queue_idx for 
> proper
> lookup of the user created engine. This is bit ackward as vm_bind is an
> interface to VM (address space) and has nothing to do with gem_context.


A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time 
right now : eb->context->vm


> Another problem is, if two VMs are binding with the same defined engine,
> binding on VM1 can get unnecessary blocked by binding on VM2 (which 
> may be
> waiting on its in_fence).


Maybe I'm missing something, but how can you have 2 vm objects with a 
single gem_context right now?


>
> So, my preference here is to just add a 'u32 queue' index in 
> vm_bind/unbind
> ioctl, and the queues are per VM.
>
> Niranjana
>
>>   Thanks,
>>
>>   -Lionel
>>
>>
>>       Niranjana
>>
>>       >Regards,
>>       >
>>       >Tvrtko
>>       >
>>       >>
>>       >>Niranjana
>>       >>
>>       >>>
>>       >>>>   I am trying to see how many queues we need and don't 
>> want it to
>>       be
>>       >>>>   arbitrarily
>>       >>>>   large and unduely blow up memory usage and complexity in 
>> i915
>>       driver.
>>       >>>>
>>       >>>> I expect a Vulkan driver to use at most 2 in the vast 
>> majority
>>       >>>>of cases. I
>>       >>>> could imagine a client wanting to create more than 1 sparse
>>       >>>>queue in which
>>       >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>>       >>>>goes, once
>>       >>>> you allow two, I don't think the complexity is going up by
>>       >>>>allowing N.  As
>>       >>>> for memory usage, creating more queues means more memory.  
>> That's
>>       a
>>       >>>> trade-off that userspace can make. Again, the expected number
>>       >>>>here is 1
>>       >>>> or 2 in the vast majority of cases so I don't think you 
>> need to
>>       worry.
>>       >>>
>>       >>>Ok, will start with n=3 meaning 8 queues.
>>       >>>That would require us create 8 workqueues.
>>       >>>We can change 'n' later if required.
>>       >>>
>>       >>>Niranjana
>>       >>>
>>       >>>>
>>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>       >>>>operations and we
>>       >>>>   don't
>>       >>>>   >     want any dependencies between them:
>>       >>>>   >      1. Immediate.  These happen right after BO 
>> creation or
>>       >>>>maybe as
>>       >>>>   part of
>>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>       >>>>don't happen
>>       >>>>   on a
>>       >>>>   >     queue and we don't want them serialized with 
>> anything.       To
>>       >>>>   synchronize
>>       >>>>   >     with submit, we'll have a syncobj in the VkDevice 
>> which
>>       is
>>       >>>>   signaled by
>>       >>>>   >     all immediate bind operations and make submits 
>> wait on
>>       it.
>>       >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>> which may
>>       be the
>>       >>>>   same as
>>       >>>>   >     a render/compute queue or may be its own queue.  
>> It's up
>>       to us
>>       >>>>   what we
>>       >>>>   >     want to advertise.  From the Vulkan API PoV, this 
>> is like
>>       any
>>       >>>>   other
>>       >>>>   >     queue.  Operations on it wait on and signal 
>> semaphores.       If we
>>       >>>>   have a
>>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>       >>>>signal just like
>>       >>>>   we do
>>       >>>>   >     in execbuf().
>>       >>>>   >     The important thing is that we don't want one type of
>>       >>>>operation to
>>       >>>>   block
>>       >>>>   >     on the other.  If immediate binds are blocking on 
>> sparse
>>       binds,
>>       >>>>   it's
>>       >>>>   >     going to cause over-synchronization issues.
>>       >>>>   >     In terms of the internal implementation, I know that
>>       >>>>there's going
>>       >>>>   to be
>>       >>>>   >     a lock on the VM and that we can't actually do these
>>       things in
>>       >>>>   >     parallel.  That's fine.  Once the dma_fences have
>>       signaled and
>>       >>>>   we're
>>       >>>>
>>       >>>>   Thats correct. It is like a single VM_BIND engine with
>>       >>>>multiple queues
>>       >>>>   feeding to it.
>>       >>>>
>>       >>>> Right.  As long as the queues themselves are independent and
>>       >>>>can block on
>>       >>>> dma_fences without holding up other queues, I think we're 
>> fine.
>>       >>>>
>>       >>>>   >     unblocked to do the bind operation, I don't care if
>>       >>>>there's a bit
>>       >>>>   of
>>       >>>>   >     synchronization due to locking.  That's expected.  
>> What
>>       >>>>we can't
>>       >>>>   afford
>>       >>>>   >     to have is an immediate bind operation suddenly 
>> blocking
>>       on a
>>       >>>>   sparse
>>       >>>>   >     operation which is blocked on a compute job that's 
>> going
>>       to run
>>       >>>>   for
>>       >>>>   >     another 5ms.
>>       >>>>
>>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>> doesn't block
>>       the
>>       >>>>   VM_BIND
>>       >>>>   on other VMs. I am not sure about usecases here, but just
>>       wanted to
>>       >>>>   clarify.
>>       >>>>
>>       >>>> Yes, that's what I would expect.
>>       >>>> --Jason
>>       >>>>
>>       >>>>   Niranjana
>>       >>>>
>>       >>>>   >     For reference, Windows solves this by allowing
>>       arbitrarily many
>>       >>>>   paging
>>       >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>       >>>>design works
>>       >>>>   >     pretty well and solves the problems in question. 
>>       >>>>Again, we could
>>       >>>>   just
>>       >>>>   >     make everything out-of-order and require using 
>> syncobjs
>>       >>>>to order
>>       >>>>   things
>>       >>>>   >     as userspace wants. That'd be fine too.
>>       >>>>   >     One more note while I'm here: danvet said 
>> something on
>>       >>>>IRC about
>>       >>>>   VM_BIND
>>       >>>>   >     queues waiting for syncobjs to materialize.  We don't
>>       really
>>       >>>>   want/need
>>       >>>>   >     this.  We already have all the machinery in 
>> userspace to
>>       handle
>>       >>>>   >     wait-before-signal and waiting for syncobj fences to
>>       >>>>materialize
>>       >>>>   and
>>       >>>>   >     that machinery is on by default.  It would actually
>>       >>>>take MORE work
>>       >>>>   in
>>       >>>>   >     Mesa to turn it off and take advantage of the kernel
>>       >>>>being able to
>>       >>>>   wait
>>       >>>>   >     for syncobjs to materialize. Also, getting that 
>> right is
>>       >>>>   ridiculously
>>       >>>>   >     hard and I really don't want to get it wrong in 
>> kernel
>>       >>>>space.     When we
>>       >>>>   >     do memory fences, wait-before-signal will be a 
>> thing.  We
>>       don't
>>       >>>>   need to
>>       >>>>   >     try and make it a thing for syncobj.
>>       >>>>   >     --Jason
>>       >>>>   >
>>       >>>>   >   Thanks Jason,
>>       >>>>   >
>>       >>>>   >   I missed the bit in the Vulkan spec that we're 
>> allowed to
>>       have a
>>       >>>>   sparse
>>       >>>>   >   queue that does not implement either graphics or 
>> compute
>>       >>>>operations
>>       >>>>   :
>>       >>>>   >
>>       >>>>   >     "While some implementations may include
>>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>       >>>>   >     support in queue families that also include
>>       >>>>   >
>>       >>>>   >      graphics and compute support, other 
>> implementations may
>>       only
>>       >>>>   expose a
>>       >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>       >>>>   >
>>       >>>>   >      family."
>>       >>>>   >
>>       >>>>   >   So it can all be all a vm_bind engine that just does
>>       bind/unbind
>>       >>>>   >   operations.
>>       >>>>   >
>>       >>>>   >   But yes we need another engine for the 
>> immediate/non-sparse
>>       >>>>   operations.
>>       >>>>   >
>>       >>>>   >   -Lionel
>>       >>>>   >
>>       >>>>   >         >
>>       >>>>   >       Daniel, any thoughts?
>>       >>>>   >
>>       >>>>   >       Niranjana
>>       >>>>   >
>>       >>>>   >       >Matt
>>       >>>>   >       >
>>       >>>>   >       >>
>>       >>>>   >       >> Sorry I noticed this late.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> -Lionel
>>       >>>>   >       >>
>>       >>>>   >       >>



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-10  6:53                                 ` Lionel Landwerlin
@ 2022-06-10  7:54                                   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-10  7:54 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>
>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>    <niranjana.vishwanathapura@intel.com> wrote:
>>>
>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>      >
>>>      >
>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>Vishwanathapura
>>>      wrote:
>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>      >>>>
>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>>>      wrote:
>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>      >>>>   >
>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>Vishwanathapura
>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>      >>>>   >
>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>      >>>>Brost wrote:
>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>      Landwerlin
>>>      >>>>   wrote:
>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>      wrote:
>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>      >>>>   binding/unbinding
>>>      >>>>   >       the mapping in an
>>>      >>>>   >       >> > +async worker. The binding and unbinding will
>>>      >>>>work like a
>>>      >>>>   special
>>>      >>>>   >       GPU engine.
>>>      >>>>   >       >> > +The binding and unbinding operations are
>>>      serialized and
>>>      >>>>   will
>>>      >>>>   >       wait on specified
>>>      >>>>   >       >> > +input fences before the operation and 
>>>will signal
>>>      the
>>>      >>>>   output
>>>      >>>>   >       fences upon the
>>>      >>>>   >       >> > +completion of the operation. Due to
>>>      serialization,
>>>      >>>>   completion of
>>>      >>>>   >       an operation
>>>      >>>>   >       >> > +will also indicate that all previous 
>>>operations
>>>      >>>>are also
>>>      >>>>   >       complete.
>>>      >>>>   >       >>
>>>      >>>>   >       >> I guess we should avoid saying "will immediately
>>>      start
>>>      >>>>   >       binding/unbinding" if
>>>      >>>>   >       >> there are fences involved.
>>>      >>>>   >       >>
>>>      >>>>   >       >> And the fact that it's happening in an async
>>>      >>>>worker seem to
>>>      >>>>   imply
>>>      >>>>   >       it's not
>>>      >>>>   >       >> immediate.
>>>      >>>>   >       >>
>>>      >>>>   >
>>>      >>>>   >       Ok, will fix.
>>>      >>>>   >       This was added because in earlier design 
>>>binding was
>>>      deferred
>>>      >>>>   until
>>>      >>>>   >       next execbuff.
>>>      >>>>   >       But now it is non-deferred (immediate in that 
>>>sense).
>>>      >>>>But yah,
>>>      >>>>   this is
>>>      >>>>   >       confusing
>>>      >>>>   >       and will fix it.
>>>      >>>>   >
>>>      >>>>   >       >>
>>>      >>>>   >       >> I have a question on the behavior of the bind
>>>      >>>>operation when
>>>      >>>>   no
>>>      >>>>   >       input fence
>>>      >>>>   >       >> is provided. Let say I do :
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> In what order are the fences going to be 
>>>signaled?
>>>      >>>>   >       >>
>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>      >>>>   >       >>
>>>      >>>>   >       >> Because you wrote "serialized I assume it's : in
>>>      order
>>>      >>>>   >       >>
>>>      >>>>   >
>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. 
>>>Note that
>>>      >>>>bind and
>>>      >>>>   unbind
>>>      >>>>   >       will use
>>>      >>>>   >       the same queue and hence are ordered.
>>>      >>>>   >
>>>      >>>>   >       >>
>>>      >>>>   >       >> One thing I didn't realize is that because 
>>>we only
>>>      get one
>>>      >>>>   >       "VM_BIND" engine,
>>>      >>>>   >       >> there is a disconnect from the Vulkan 
>>>specification.
>>>      >>>>   >       >>
>>>      >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>>      >>>>per engine.
>>>      >>>>   >       >>
>>>      >>>>   >       >> So you could have something like this :
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>      out_fence=fence2)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>      out_fence=fence4)
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> fence1 is not signaled
>>>      >>>>   >       >>
>>>      >>>>   >       >> fence3 is signaled
>>>      >>>>   >       >>
>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>>>      >>>>first VM_BIND.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> I guess we can deal with that scenario in
>>>      >>>>userspace by doing
>>>      >>>>   the
>>>      >>>>   >       wait
>>>      >>>>   >       >> ourselves in one thread per engines.
>>>      >>>>   >       >>
>>>      >>>>   >       >> But then it makes the VM_BIND input fences 
>>>useless.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> Daniel : what do you think? Should be 
>>>rework this or
>>>      just
>>>      >>>>   deal with
>>>      >>>>   >       wait
>>>      >>>>   >       >> fences in userspace?
>>>      >>>>   >       >>
>>>      >>>>   >       >
>>>      >>>>   >       >My opinion is rework this but make the 
>>>ordering via
>>>      >>>>an engine
>>>      >>>>   param
>>>      >>>>   >       optional.
>>>      >>>>   >       >
>>>      >>>>   >       >e.g. A VM can be configured so all binds are 
>>>ordered
>>>      >>>>within the
>>>      >>>>   VM
>>>      >>>>   >       >
>>>      >>>>   >       >e.g. A VM can be configured so all binds accept an
>>>      engine
>>>      >>>>   argument
>>>      >>>>   >       (in
>>>      >>>>   >       >the case of the i915 likely this is a gem context
>>>      >>>>handle) and
>>>      >>>>   binds
>>>      >>>>   >       >ordered with respect to that engine.
>>>      >>>>   >       >
>>>      >>>>   >       >This gives UMDs options as the later likely 
>>>consumes
>>>      >>>>more KMD
>>>      >>>>   >       resources
>>>      >>>>   >       >so if a different UMD can live with binds being
>>>      >>>>ordered within
>>>      >>>>   the VM
>>>      >>>>   >       >they can use a mode consuming less resources.
>>>      >>>>   >       >
>>>      >>>>   >
>>>      >>>>   >       I think we need to be careful here if we are 
>>>looking
>>>      for some
>>>      >>>>   out of
>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>>>      >>>>   >       In-order completion means, in a batch of binds and
>>>      >>>>unbinds to be
>>>      >>>>   >       completed in-order, user only needs to specify
>>>      >>>>in-fence for the
>>>      >>>>   >       first bind/unbind call and the our-fence for 
>>>the last
>>>      >>>>   bind/unbind
>>>      >>>>   >       call. Also, the VA released by an unbind call 
>>>can be
>>>      >>>>re-used by
>>>      >>>>   >       any subsequent bind call in that in-order batch.
>>>      >>>>   >
>>>      >>>>   >       These things will break if binding/unbinding 
>>>were to
>>>      >>>>be allowed
>>>      >>>>   to
>>>      >>>>   >       go out of order (of submission) and user need to be
>>>      extra
>>>      >>>>   careful
>>>      >>>>   >       not to run into pre-mature triggereing of 
>>>out-fence and
>>>      bind
>>>      >>>>   failing
>>>      >>>>   >       as VA is still in use etc.
>>>      >>>>   >
>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>      specified
>>>      >>>>   address
>>>      >>>>   >       space
>>>      >>>>   >       (VM). So, the uapi is not engine/context specific.
>>>      >>>>   >
>>>      >>>>   >       We can however add a 'queue' to the uapi 
>>>which can be
>>>      >>>>one from
>>>      >>>>   the
>>>      >>>>   >       pre-defined queues,
>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>>>      >>>>   >       ...
>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>      >>>>   >
>>>      >>>>   >       KMD will spawn an async work queue for each 
>>>queue which
>>>      will
>>>      >>>>   only
>>>      >>>>   >       bind the mappings on that queue in the order of
>>>      submission.
>>>      >>>>   >       User can assign the queue to per engine or anything
>>>      >>>>like that.
>>>      >>>>   >
>>>      >>>>   >       But again here, user need to be careful and not
>>>      >>>>deadlock these
>>>      >>>>   >       queues with circular dependency of fences.
>>>      >>>>   >
>>>      >>>>   >       I prefer adding this later an as extension based on
>>>      >>>>whether it
>>>      >>>>   >       is really helping with the implementation.
>>>      >>>>   >
>>>      >>>>   >     I can tell you right now that having everything on a
>>>      single
>>>      >>>>   in-order
>>>      >>>>   >     queue will not get us the perf we want.  What vulkan
>>>      >>>>really wants
>>>      >>>>   is one
>>>      >>>>   >     of two things:
>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>      happen in
>>>      >>>>   whatever
>>>      >>>>   >     their dependencies are resolved and we ensure 
>>>ordering
>>>      >>>>ourselves
>>>      >>>>   by
>>>      >>>>   >     having a syncobj in the VkQueue.
>>>      >>>>   >      2. The ability to create multiple VM_BIND 
>>>queues.  We
>>>      need at
>>>      >>>>   least 2
>>>      >>>>   >     but I don't see why there needs to be a limit besides
>>>      >>>>the limits
>>>      >>>>   the
>>>      >>>>   >     i915 API already has on the number of engines.  
>>>Vulkan
>>>      could
>>>      >>>>   expose
>>>      >>>>   >     multiple sparse binding queues to the client if 
>>>it's not
>>>      >>>>   arbitrarily
>>>      >>>>   >     limited.
>>>      >>>>
>>>      >>>>   Thanks Jason, Lionel.
>>>      >>>>
>>>      >>>>   Jason, what are you referring to when you say "limits 
>>>the i915
>>>      API
>>>      >>>>   already
>>>      >>>>   has on the number of engines"? I am not sure if there 
>>>is such
>>>      an uapi
>>>      >>>>   today.
>>>      >>>>
>>>      >>>> There's a limit of something like 64 total engines 
>>>today based on
>>>      the
>>>      >>>> number of bits we can cram into the exec flags in 
>>>execbuffer2.  I
>>>      think
>>>      >>>> someone had an extended version that allowed more but I 
>>>ripped it
>>>      out
>>>      >>>> because no one was using it.  Of course, execbuffer3 
>>>might not
>>>      >>>>have that
>>>      >>>> problem at all.
>>>      >>>>
>>>      >>>
>>>      >>>Thanks Jason.
>>>      >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>>>      probably
>>>      >>>will not have this limiation. So, we need to define a
>>>      VM_BIND_MAX_QUEUE
>>>      >>>and somehow export it to user (I am thinking of embedding it in
>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>      meaning 2^n
>>>      >>>queues.
>>>      >>
>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK 
>>>(0x3f) which
>>>      execbuf3
>>>
>>>    Yup!  That's exactly the limit I was talking about.
>>>
>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>      structures,
>>>      >>
>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>      >>        __u32 queue;
>>>      >>
>>>      >>I think that will keep things simple.
>>>      >
>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>>>      >hardware can have? I suggest not to do that.
>>>      >
>>>      >Change with added this:
>>>      >
>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>      >               return -EINVAL;
>>>      >
>>>      >To context creation needs to be undone and so let users 
>>>create engine
>>>      >maps with all hardware engines, and let execbuf3 access them all.
>>>      >
>>>
>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>execbuff3 also.
>>>      Hence, I was using the same limit for VM_BIND queues (64, or 
>>>65 if we
>>>      make it N+1).
>>>      But, as discussed in other thread of this RFC series, we are 
>>>planning
>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>      any uapi that limits the number of engines (and hence the vm_bind
>>>      queues
>>>      need to be supported).
>>>
>>>      If we leave the number of vm_bind queues to be arbitrarily large
>>>      (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>>>      work_item and a linked list) lookup from the user specified queue
>>>      index.
>>>      Other option is to just put some hard limit (say 64 or 65) and use
>>>      an array of queues in VM (each created upon first use). I 
>>>prefer this.
>>>
>>>    I don't get why a VM_BIND queue is any different from any 
>>>other queue or
>>>    userspace-visible kernel object.  But I'll leave those details up to
>>>    danvet or whoever else might be reviewing the implementation.
>>>    --Jason
>>>
>>>  I kind of agree here. Wouldn't be simpler to have the bind queue 
>>>created
>>>  like the others when we build the engine map?
>>>
>>>  For userspace it's then just matter of selecting the right queue 
>>>ID when
>>>  submitting.
>>>
>>>  If there is ever a possibility to have this work on the GPU, it 
>>>would be
>>>  all ready.
>>>
>>
>>I did sync offline with Matt Brost on this.
>>We can add a VM_BIND engine class and let user create VM_BIND 
>>engines (queues).
>>The problem is, in i915 engine creating interface is bound to 
>>gem_context.
>>So, in vm_bind ioctl, we would need both context_id and queue_idx 
>>for proper
>>lookup of the user created engine. This is bit ackward as vm_bind is an
>>interface to VM (address space) and has nothing to do with gem_context.
>
>
>A gem_context has a single vm object right?
>
>Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
>
>So it's just like picking up the vm like it's done at execbuffer time 
>right now : eb->context->vm
>

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND
ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained
from the context?
I think the interface is clean as a interface to VM. It is only that we
don't have a clean way to create a raw VM_BIND engine (not associated with
any context) with i915 uapi.
May be we can add such an interface, but I don't think that is worth it
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned
above).
Anyone has any thoughts?

>
>>Another problem is, if two VMs are binding with the same defined engine,
>>binding on VM1 can get unnecessary blocked by binding on VM2 (which 
>>may be
>>waiting on its in_fence).
>
>
>Maybe I'm missing something, but how can you have 2 vm objects with a 
>single gem_context right now?
>

No, we don't have 2 VMs for a gem_context.
Say if ctx1 with vm1 and ctx2 with vm2.
First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. 
If those two queue indicies points to same underlying vm_bind engine,
then the second vm_bind call gets blocked until the first vm_bind call's
'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup
sharing same queue.

BTW, I just posted a updated PATCH series.
https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

>
>>
>>So, my preference here is to just add a 'u32 queue' index in 
>>vm_bind/unbind
>>ioctl, and the queues are per VM.
>>
>>Niranjana
>>
>>>  Thanks,
>>>
>>>  -Lionel
>>>
>>>
>>>      Niranjana
>>>
>>>      >Regards,
>>>      >
>>>      >Tvrtko
>>>      >
>>>      >>
>>>      >>Niranjana
>>>      >>
>>>      >>>
>>>      >>>>   I am trying to see how many queues we need and don't 
>>>want it to
>>>      be
>>>      >>>>   arbitrarily
>>>      >>>>   large and unduely blow up memory usage and complexity 
>>>in i915
>>>      driver.
>>>      >>>>
>>>      >>>> I expect a Vulkan driver to use at most 2 in the vast 
>>>majority
>>>      >>>>of cases. I
>>>      >>>> could imagine a client wanting to create more than 1 sparse
>>>      >>>>queue in which
>>>      >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>>>      >>>>goes, once
>>>      >>>> you allow two, I don't think the complexity is going up by
>>>      >>>>allowing N.  As
>>>      >>>> for memory usage, creating more queues means more 
>>>memory.  That's
>>>      a
>>>      >>>> trade-off that userspace can make. Again, the expected number
>>>      >>>>here is 1
>>>      >>>> or 2 in the vast majority of cases so I don't think you 
>>>need to
>>>      worry.
>>>      >>>
>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>      >>>That would require us create 8 workqueues.
>>>      >>>We can change 'n' later if required.
>>>      >>>
>>>      >>>Niranjana
>>>      >>>
>>>      >>>>
>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>      >>>>operations and we
>>>      >>>>   don't
>>>      >>>>   >     want any dependencies between them:
>>>      >>>>   >      1. Immediate.  These happen right after BO 
>>>creation or
>>>      >>>>maybe as
>>>      >>>>   part of
>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>      >>>>don't happen
>>>      >>>>   on a
>>>      >>>>   >     queue and we don't want them serialized with 
>>>anything.       To
>>>      >>>>   synchronize
>>>      >>>>   >     with submit, we'll have a syncobj in the 
>>>VkDevice which
>>>      is
>>>      >>>>   signaled by
>>>      >>>>   >     all immediate bind operations and make submits 
>>>wait on
>>>      it.
>>>      >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>>>which may
>>>      be the
>>>      >>>>   same as
>>>      >>>>   >     a render/compute queue or may be its own 
>>>queue.  It's up
>>>      to us
>>>      >>>>   what we
>>>      >>>>   >     want to advertise.  From the Vulkan API PoV, 
>>>this is like
>>>      any
>>>      >>>>   other
>>>      >>>>   >     queue.  Operations on it wait on and signal 
>>>semaphores.       If we
>>>      >>>>   have a
>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>      >>>>signal just like
>>>      >>>>   we do
>>>      >>>>   >     in execbuf().
>>>      >>>>   >     The important thing is that we don't want one type of
>>>      >>>>operation to
>>>      >>>>   block
>>>      >>>>   >     on the other.  If immediate binds are blocking 
>>>on sparse
>>>      binds,
>>>      >>>>   it's
>>>      >>>>   >     going to cause over-synchronization issues.
>>>      >>>>   >     In terms of the internal implementation, I know that
>>>      >>>>there's going
>>>      >>>>   to be
>>>      >>>>   >     a lock on the VM and that we can't actually do these
>>>      things in
>>>      >>>>   >     parallel.  That's fine.  Once the dma_fences have
>>>      signaled and
>>>      >>>>   we're
>>>      >>>>
>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>>>      >>>>multiple queues
>>>      >>>>   feeding to it.
>>>      >>>>
>>>      >>>> Right.  As long as the queues themselves are independent and
>>>      >>>>can block on
>>>      >>>> dma_fences without holding up other queues, I think 
>>>we're fine.
>>>      >>>>
>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>>>      >>>>there's a bit
>>>      >>>>   of
>>>      >>>>   >     synchronization due to locking.  That's 
>>>expected.  What
>>>      >>>>we can't
>>>      >>>>   afford
>>>      >>>>   >     to have is an immediate bind operation suddenly 
>>>blocking
>>>      on a
>>>      >>>>   sparse
>>>      >>>>   >     operation which is blocked on a compute job 
>>>that's going
>>>      to run
>>>      >>>>   for
>>>      >>>>   >     another 5ms.
>>>      >>>>
>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>doesn't block
>>>      the
>>>      >>>>   VM_BIND
>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>>>      wanted to
>>>      >>>>   clarify.
>>>      >>>>
>>>      >>>> Yes, that's what I would expect.
>>>      >>>> --Jason
>>>      >>>>
>>>      >>>>   Niranjana
>>>      >>>>
>>>      >>>>   >     For reference, Windows solves this by allowing
>>>      arbitrarily many
>>>      >>>>   paging
>>>      >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>>      >>>>design works
>>>      >>>>   >     pretty well and solves the problems in 
>>>question.       >>>>Again, we could
>>>      >>>>   just
>>>      >>>>   >     make everything out-of-order and require using 
>>>syncobjs
>>>      >>>>to order
>>>      >>>>   things
>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>      >>>>   >     One more note while I'm here: danvet said 
>>>something on
>>>      >>>>IRC about
>>>      >>>>   VM_BIND
>>>      >>>>   >     queues waiting for syncobjs to materialize.  We don't
>>>      really
>>>      >>>>   want/need
>>>      >>>>   >     this.  We already have all the machinery in 
>>>userspace to
>>>      handle
>>>      >>>>   >     wait-before-signal and waiting for syncobj fences to
>>>      >>>>materialize
>>>      >>>>   and
>>>      >>>>   >     that machinery is on by default.  It would actually
>>>      >>>>take MORE work
>>>      >>>>   in
>>>      >>>>   >     Mesa to turn it off and take advantage of the kernel
>>>      >>>>being able to
>>>      >>>>   wait
>>>      >>>>   >     for syncobjs to materialize. Also, getting that 
>>>right is
>>>      >>>>   ridiculously
>>>      >>>>   >     hard and I really don't want to get it wrong in 
>>>kernel
>>>      >>>>space.     When we
>>>      >>>>   >     do memory fences, wait-before-signal will be a 
>>>thing.  We
>>>      don't
>>>      >>>>   need to
>>>      >>>>   >     try and make it a thing for syncobj.
>>>      >>>>   >     --Jason
>>>      >>>>   >
>>>      >>>>   >   Thanks Jason,
>>>      >>>>   >
>>>      >>>>   >   I missed the bit in the Vulkan spec that we're 
>>>allowed to
>>>      have a
>>>      >>>>   sparse
>>>      >>>>   >   queue that does not implement either graphics or 
>>>compute
>>>      >>>>operations
>>>      >>>>   :
>>>      >>>>   >
>>>      >>>>   >     "While some implementations may include
>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>      >>>>   >     support in queue families that also include
>>>      >>>>   >
>>>      >>>>   >      graphics and compute support, other 
>>>implementations may
>>>      only
>>>      >>>>   expose a
>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>      >>>>   >
>>>      >>>>   >      family."
>>>      >>>>   >
>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>>>      bind/unbind
>>>      >>>>   >   operations.
>>>      >>>>   >
>>>      >>>>   >   But yes we need another engine for the 
>>>immediate/non-sparse
>>>      >>>>   operations.
>>>      >>>>   >
>>>      >>>>   >   -Lionel
>>>      >>>>   >
>>>      >>>>   >         >
>>>      >>>>   >       Daniel, any thoughts?
>>>      >>>>   >
>>>      >>>>   >       Niranjana
>>>      >>>>   >
>>>      >>>>   >       >Matt
>>>      >>>>   >       >
>>>      >>>>   >       >>
>>>      >>>>   >       >> Sorry I noticed this late.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> -Lionel
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-10  7:54                                   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-10  7:54 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>
>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>    <niranjana.vishwanathapura@intel.com> wrote:
>>>
>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>      >
>>>      >
>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>Vishwanathapura
>>>      wrote:
>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>      >>>>
>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>>>      wrote:
>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>      >>>>   >
>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>Vishwanathapura
>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>      >>>>   >
>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>      >>>>Brost wrote:
>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>      Landwerlin
>>>      >>>>   wrote:
>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>      wrote:
>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>      >>>>   binding/unbinding
>>>      >>>>   >       the mapping in an
>>>      >>>>   >       >> > +async worker. The binding and unbinding will
>>>      >>>>work like a
>>>      >>>>   special
>>>      >>>>   >       GPU engine.
>>>      >>>>   >       >> > +The binding and unbinding operations are
>>>      serialized and
>>>      >>>>   will
>>>      >>>>   >       wait on specified
>>>      >>>>   >       >> > +input fences before the operation and 
>>>will signal
>>>      the
>>>      >>>>   output
>>>      >>>>   >       fences upon the
>>>      >>>>   >       >> > +completion of the operation. Due to
>>>      serialization,
>>>      >>>>   completion of
>>>      >>>>   >       an operation
>>>      >>>>   >       >> > +will also indicate that all previous 
>>>operations
>>>      >>>>are also
>>>      >>>>   >       complete.
>>>      >>>>   >       >>
>>>      >>>>   >       >> I guess we should avoid saying "will immediately
>>>      start
>>>      >>>>   >       binding/unbinding" if
>>>      >>>>   >       >> there are fences involved.
>>>      >>>>   >       >>
>>>      >>>>   >       >> And the fact that it's happening in an async
>>>      >>>>worker seem to
>>>      >>>>   imply
>>>      >>>>   >       it's not
>>>      >>>>   >       >> immediate.
>>>      >>>>   >       >>
>>>      >>>>   >
>>>      >>>>   >       Ok, will fix.
>>>      >>>>   >       This was added because in earlier design 
>>>binding was
>>>      deferred
>>>      >>>>   until
>>>      >>>>   >       next execbuff.
>>>      >>>>   >       But now it is non-deferred (immediate in that 
>>>sense).
>>>      >>>>But yah,
>>>      >>>>   this is
>>>      >>>>   >       confusing
>>>      >>>>   >       and will fix it.
>>>      >>>>   >
>>>      >>>>   >       >>
>>>      >>>>   >       >> I have a question on the behavior of the bind
>>>      >>>>operation when
>>>      >>>>   no
>>>      >>>>   >       input fence
>>>      >>>>   >       >> is provided. Let say I do :
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> In what order are the fences going to be 
>>>signaled?
>>>      >>>>   >       >>
>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>      >>>>   >       >>
>>>      >>>>   >       >> Because you wrote "serialized I assume it's : in
>>>      order
>>>      >>>>   >       >>
>>>      >>>>   >
>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. 
>>>Note that
>>>      >>>>bind and
>>>      >>>>   unbind
>>>      >>>>   >       will use
>>>      >>>>   >       the same queue and hence are ordered.
>>>      >>>>   >
>>>      >>>>   >       >>
>>>      >>>>   >       >> One thing I didn't realize is that because 
>>>we only
>>>      get one
>>>      >>>>   >       "VM_BIND" engine,
>>>      >>>>   >       >> there is a disconnect from the Vulkan 
>>>specification.
>>>      >>>>   >       >>
>>>      >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>>      >>>>per engine.
>>>      >>>>   >       >>
>>>      >>>>   >       >> So you could have something like this :
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>      out_fence=fence2)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>      out_fence=fence4)
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> fence1 is not signaled
>>>      >>>>   >       >>
>>>      >>>>   >       >> fence3 is signaled
>>>      >>>>   >       >>
>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>>>      >>>>first VM_BIND.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> I guess we can deal with that scenario in
>>>      >>>>userspace by doing
>>>      >>>>   the
>>>      >>>>   >       wait
>>>      >>>>   >       >> ourselves in one thread per engines.
>>>      >>>>   >       >>
>>>      >>>>   >       >> But then it makes the VM_BIND input fences 
>>>useless.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> Daniel : what do you think? Should be 
>>>rework this or
>>>      just
>>>      >>>>   deal with
>>>      >>>>   >       wait
>>>      >>>>   >       >> fences in userspace?
>>>      >>>>   >       >>
>>>      >>>>   >       >
>>>      >>>>   >       >My opinion is rework this but make the 
>>>ordering via
>>>      >>>>an engine
>>>      >>>>   param
>>>      >>>>   >       optional.
>>>      >>>>   >       >
>>>      >>>>   >       >e.g. A VM can be configured so all binds are 
>>>ordered
>>>      >>>>within the
>>>      >>>>   VM
>>>      >>>>   >       >
>>>      >>>>   >       >e.g. A VM can be configured so all binds accept an
>>>      engine
>>>      >>>>   argument
>>>      >>>>   >       (in
>>>      >>>>   >       >the case of the i915 likely this is a gem context
>>>      >>>>handle) and
>>>      >>>>   binds
>>>      >>>>   >       >ordered with respect to that engine.
>>>      >>>>   >       >
>>>      >>>>   >       >This gives UMDs options as the later likely 
>>>consumes
>>>      >>>>more KMD
>>>      >>>>   >       resources
>>>      >>>>   >       >so if a different UMD can live with binds being
>>>      >>>>ordered within
>>>      >>>>   the VM
>>>      >>>>   >       >they can use a mode consuming less resources.
>>>      >>>>   >       >
>>>      >>>>   >
>>>      >>>>   >       I think we need to be careful here if we are 
>>>looking
>>>      for some
>>>      >>>>   out of
>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>>>      >>>>   >       In-order completion means, in a batch of binds and
>>>      >>>>unbinds to be
>>>      >>>>   >       completed in-order, user only needs to specify
>>>      >>>>in-fence for the
>>>      >>>>   >       first bind/unbind call and the our-fence for 
>>>the last
>>>      >>>>   bind/unbind
>>>      >>>>   >       call. Also, the VA released by an unbind call 
>>>can be
>>>      >>>>re-used by
>>>      >>>>   >       any subsequent bind call in that in-order batch.
>>>      >>>>   >
>>>      >>>>   >       These things will break if binding/unbinding 
>>>were to
>>>      >>>>be allowed
>>>      >>>>   to
>>>      >>>>   >       go out of order (of submission) and user need to be
>>>      extra
>>>      >>>>   careful
>>>      >>>>   >       not to run into pre-mature triggereing of 
>>>out-fence and
>>>      bind
>>>      >>>>   failing
>>>      >>>>   >       as VA is still in use etc.
>>>      >>>>   >
>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>      specified
>>>      >>>>   address
>>>      >>>>   >       space
>>>      >>>>   >       (VM). So, the uapi is not engine/context specific.
>>>      >>>>   >
>>>      >>>>   >       We can however add a 'queue' to the uapi 
>>>which can be
>>>      >>>>one from
>>>      >>>>   the
>>>      >>>>   >       pre-defined queues,
>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>>>      >>>>   >       ...
>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>      >>>>   >
>>>      >>>>   >       KMD will spawn an async work queue for each 
>>>queue which
>>>      will
>>>      >>>>   only
>>>      >>>>   >       bind the mappings on that queue in the order of
>>>      submission.
>>>      >>>>   >       User can assign the queue to per engine or anything
>>>      >>>>like that.
>>>      >>>>   >
>>>      >>>>   >       But again here, user need to be careful and not
>>>      >>>>deadlock these
>>>      >>>>   >       queues with circular dependency of fences.
>>>      >>>>   >
>>>      >>>>   >       I prefer adding this later an as extension based on
>>>      >>>>whether it
>>>      >>>>   >       is really helping with the implementation.
>>>      >>>>   >
>>>      >>>>   >     I can tell you right now that having everything on a
>>>      single
>>>      >>>>   in-order
>>>      >>>>   >     queue will not get us the perf we want.  What vulkan
>>>      >>>>really wants
>>>      >>>>   is one
>>>      >>>>   >     of two things:
>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>      happen in
>>>      >>>>   whatever
>>>      >>>>   >     their dependencies are resolved and we ensure 
>>>ordering
>>>      >>>>ourselves
>>>      >>>>   by
>>>      >>>>   >     having a syncobj in the VkQueue.
>>>      >>>>   >      2. The ability to create multiple VM_BIND 
>>>queues.  We
>>>      need at
>>>      >>>>   least 2
>>>      >>>>   >     but I don't see why there needs to be a limit besides
>>>      >>>>the limits
>>>      >>>>   the
>>>      >>>>   >     i915 API already has on the number of engines.  
>>>Vulkan
>>>      could
>>>      >>>>   expose
>>>      >>>>   >     multiple sparse binding queues to the client if 
>>>it's not
>>>      >>>>   arbitrarily
>>>      >>>>   >     limited.
>>>      >>>>
>>>      >>>>   Thanks Jason, Lionel.
>>>      >>>>
>>>      >>>>   Jason, what are you referring to when you say "limits 
>>>the i915
>>>      API
>>>      >>>>   already
>>>      >>>>   has on the number of engines"? I am not sure if there 
>>>is such
>>>      an uapi
>>>      >>>>   today.
>>>      >>>>
>>>      >>>> There's a limit of something like 64 total engines 
>>>today based on
>>>      the
>>>      >>>> number of bits we can cram into the exec flags in 
>>>execbuffer2.  I
>>>      think
>>>      >>>> someone had an extended version that allowed more but I 
>>>ripped it
>>>      out
>>>      >>>> because no one was using it.  Of course, execbuffer3 
>>>might not
>>>      >>>>have that
>>>      >>>> problem at all.
>>>      >>>>
>>>      >>>
>>>      >>>Thanks Jason.
>>>      >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>>>      probably
>>>      >>>will not have this limiation. So, we need to define a
>>>      VM_BIND_MAX_QUEUE
>>>      >>>and somehow export it to user (I am thinking of embedding it in
>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>      meaning 2^n
>>>      >>>queues.
>>>      >>
>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK 
>>>(0x3f) which
>>>      execbuf3
>>>
>>>    Yup!  That's exactly the limit I was talking about.
>>>
>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>      structures,
>>>      >>
>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>      >>        __u32 queue;
>>>      >>
>>>      >>I think that will keep things simple.
>>>      >
>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>>>      >hardware can have? I suggest not to do that.
>>>      >
>>>      >Change with added this:
>>>      >
>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>      >               return -EINVAL;
>>>      >
>>>      >To context creation needs to be undone and so let users 
>>>create engine
>>>      >maps with all hardware engines, and let execbuf3 access them all.
>>>      >
>>>
>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>execbuff3 also.
>>>      Hence, I was using the same limit for VM_BIND queues (64, or 
>>>65 if we
>>>      make it N+1).
>>>      But, as discussed in other thread of this RFC series, we are 
>>>planning
>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>      any uapi that limits the number of engines (and hence the vm_bind
>>>      queues
>>>      need to be supported).
>>>
>>>      If we leave the number of vm_bind queues to be arbitrarily large
>>>      (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>>>      work_item and a linked list) lookup from the user specified queue
>>>      index.
>>>      Other option is to just put some hard limit (say 64 or 65) and use
>>>      an array of queues in VM (each created upon first use). I 
>>>prefer this.
>>>
>>>    I don't get why a VM_BIND queue is any different from any 
>>>other queue or
>>>    userspace-visible kernel object.  But I'll leave those details up to
>>>    danvet or whoever else might be reviewing the implementation.
>>>    --Jason
>>>
>>>  I kind of agree here. Wouldn't be simpler to have the bind queue 
>>>created
>>>  like the others when we build the engine map?
>>>
>>>  For userspace it's then just matter of selecting the right queue 
>>>ID when
>>>  submitting.
>>>
>>>  If there is ever a possibility to have this work on the GPU, it 
>>>would be
>>>  all ready.
>>>
>>
>>I did sync offline with Matt Brost on this.
>>We can add a VM_BIND engine class and let user create VM_BIND 
>>engines (queues).
>>The problem is, in i915 engine creating interface is bound to 
>>gem_context.
>>So, in vm_bind ioctl, we would need both context_id and queue_idx 
>>for proper
>>lookup of the user created engine. This is bit ackward as vm_bind is an
>>interface to VM (address space) and has nothing to do with gem_context.
>
>
>A gem_context has a single vm object right?
>
>Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
>
>So it's just like picking up the vm like it's done at execbuffer time 
>right now : eb->context->vm
>

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND
ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained
from the context?
I think the interface is clean as a interface to VM. It is only that we
don't have a clean way to create a raw VM_BIND engine (not associated with
any context) with i915 uapi.
May be we can add such an interface, but I don't think that is worth it
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned
above).
Anyone has any thoughts?

>
>>Another problem is, if two VMs are binding with the same defined engine,
>>binding on VM1 can get unnecessary blocked by binding on VM2 (which 
>>may be
>>waiting on its in_fence).
>
>
>Maybe I'm missing something, but how can you have 2 vm objects with a 
>single gem_context right now?
>

No, we don't have 2 VMs for a gem_context.
Say if ctx1 with vm1 and ctx2 with vm2.
First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. 
If those two queue indicies points to same underlying vm_bind engine,
then the second vm_bind call gets blocked until the first vm_bind call's
'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup
sharing same queue.

BTW, I just posted a updated PATCH series.
https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

>
>>
>>So, my preference here is to just add a 'u32 queue' index in 
>>vm_bind/unbind
>>ioctl, and the queues are per VM.
>>
>>Niranjana
>>
>>>  Thanks,
>>>
>>>  -Lionel
>>>
>>>
>>>      Niranjana
>>>
>>>      >Regards,
>>>      >
>>>      >Tvrtko
>>>      >
>>>      >>
>>>      >>Niranjana
>>>      >>
>>>      >>>
>>>      >>>>   I am trying to see how many queues we need and don't 
>>>want it to
>>>      be
>>>      >>>>   arbitrarily
>>>      >>>>   large and unduely blow up memory usage and complexity 
>>>in i915
>>>      driver.
>>>      >>>>
>>>      >>>> I expect a Vulkan driver to use at most 2 in the vast 
>>>majority
>>>      >>>>of cases. I
>>>      >>>> could imagine a client wanting to create more than 1 sparse
>>>      >>>>queue in which
>>>      >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>>>      >>>>goes, once
>>>      >>>> you allow two, I don't think the complexity is going up by
>>>      >>>>allowing N.  As
>>>      >>>> for memory usage, creating more queues means more 
>>>memory.  That's
>>>      a
>>>      >>>> trade-off that userspace can make. Again, the expected number
>>>      >>>>here is 1
>>>      >>>> or 2 in the vast majority of cases so I don't think you 
>>>need to
>>>      worry.
>>>      >>>
>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>      >>>That would require us create 8 workqueues.
>>>      >>>We can change 'n' later if required.
>>>      >>>
>>>      >>>Niranjana
>>>      >>>
>>>      >>>>
>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>      >>>>operations and we
>>>      >>>>   don't
>>>      >>>>   >     want any dependencies between them:
>>>      >>>>   >      1. Immediate.  These happen right after BO 
>>>creation or
>>>      >>>>maybe as
>>>      >>>>   part of
>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>      >>>>don't happen
>>>      >>>>   on a
>>>      >>>>   >     queue and we don't want them serialized with 
>>>anything.       To
>>>      >>>>   synchronize
>>>      >>>>   >     with submit, we'll have a syncobj in the 
>>>VkDevice which
>>>      is
>>>      >>>>   signaled by
>>>      >>>>   >     all immediate bind operations and make submits 
>>>wait on
>>>      it.
>>>      >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>>>which may
>>>      be the
>>>      >>>>   same as
>>>      >>>>   >     a render/compute queue or may be its own 
>>>queue.  It's up
>>>      to us
>>>      >>>>   what we
>>>      >>>>   >     want to advertise.  From the Vulkan API PoV, 
>>>this is like
>>>      any
>>>      >>>>   other
>>>      >>>>   >     queue.  Operations on it wait on and signal 
>>>semaphores.       If we
>>>      >>>>   have a
>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>      >>>>signal just like
>>>      >>>>   we do
>>>      >>>>   >     in execbuf().
>>>      >>>>   >     The important thing is that we don't want one type of
>>>      >>>>operation to
>>>      >>>>   block
>>>      >>>>   >     on the other.  If immediate binds are blocking 
>>>on sparse
>>>      binds,
>>>      >>>>   it's
>>>      >>>>   >     going to cause over-synchronization issues.
>>>      >>>>   >     In terms of the internal implementation, I know that
>>>      >>>>there's going
>>>      >>>>   to be
>>>      >>>>   >     a lock on the VM and that we can't actually do these
>>>      things in
>>>      >>>>   >     parallel.  That's fine.  Once the dma_fences have
>>>      signaled and
>>>      >>>>   we're
>>>      >>>>
>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>>>      >>>>multiple queues
>>>      >>>>   feeding to it.
>>>      >>>>
>>>      >>>> Right.  As long as the queues themselves are independent and
>>>      >>>>can block on
>>>      >>>> dma_fences without holding up other queues, I think 
>>>we're fine.
>>>      >>>>
>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>>>      >>>>there's a bit
>>>      >>>>   of
>>>      >>>>   >     synchronization due to locking.  That's 
>>>expected.  What
>>>      >>>>we can't
>>>      >>>>   afford
>>>      >>>>   >     to have is an immediate bind operation suddenly 
>>>blocking
>>>      on a
>>>      >>>>   sparse
>>>      >>>>   >     operation which is blocked on a compute job 
>>>that's going
>>>      to run
>>>      >>>>   for
>>>      >>>>   >     another 5ms.
>>>      >>>>
>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>doesn't block
>>>      the
>>>      >>>>   VM_BIND
>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>>>      wanted to
>>>      >>>>   clarify.
>>>      >>>>
>>>      >>>> Yes, that's what I would expect.
>>>      >>>> --Jason
>>>      >>>>
>>>      >>>>   Niranjana
>>>      >>>>
>>>      >>>>   >     For reference, Windows solves this by allowing
>>>      arbitrarily many
>>>      >>>>   paging
>>>      >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>>      >>>>design works
>>>      >>>>   >     pretty well and solves the problems in 
>>>question.       >>>>Again, we could
>>>      >>>>   just
>>>      >>>>   >     make everything out-of-order and require using 
>>>syncobjs
>>>      >>>>to order
>>>      >>>>   things
>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>      >>>>   >     One more note while I'm here: danvet said 
>>>something on
>>>      >>>>IRC about
>>>      >>>>   VM_BIND
>>>      >>>>   >     queues waiting for syncobjs to materialize.  We don't
>>>      really
>>>      >>>>   want/need
>>>      >>>>   >     this.  We already have all the machinery in 
>>>userspace to
>>>      handle
>>>      >>>>   >     wait-before-signal and waiting for syncobj fences to
>>>      >>>>materialize
>>>      >>>>   and
>>>      >>>>   >     that machinery is on by default.  It would actually
>>>      >>>>take MORE work
>>>      >>>>   in
>>>      >>>>   >     Mesa to turn it off and take advantage of the kernel
>>>      >>>>being able to
>>>      >>>>   wait
>>>      >>>>   >     for syncobjs to materialize. Also, getting that 
>>>right is
>>>      >>>>   ridiculously
>>>      >>>>   >     hard and I really don't want to get it wrong in 
>>>kernel
>>>      >>>>space.     When we
>>>      >>>>   >     do memory fences, wait-before-signal will be a 
>>>thing.  We
>>>      don't
>>>      >>>>   need to
>>>      >>>>   >     try and make it a thing for syncobj.
>>>      >>>>   >     --Jason
>>>      >>>>   >
>>>      >>>>   >   Thanks Jason,
>>>      >>>>   >
>>>      >>>>   >   I missed the bit in the Vulkan spec that we're 
>>>allowed to
>>>      have a
>>>      >>>>   sparse
>>>      >>>>   >   queue that does not implement either graphics or 
>>>compute
>>>      >>>>operations
>>>      >>>>   :
>>>      >>>>   >
>>>      >>>>   >     "While some implementations may include
>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>      >>>>   >     support in queue families that also include
>>>      >>>>   >
>>>      >>>>   >      graphics and compute support, other 
>>>implementations may
>>>      only
>>>      >>>>   expose a
>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>      >>>>   >
>>>      >>>>   >      family."
>>>      >>>>   >
>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>>>      bind/unbind
>>>      >>>>   >   operations.
>>>      >>>>   >
>>>      >>>>   >   But yes we need another engine for the 
>>>immediate/non-sparse
>>>      >>>>   operations.
>>>      >>>>   >
>>>      >>>>   >   -Lionel
>>>      >>>>   >
>>>      >>>>   >         >
>>>      >>>>   >       Daniel, any thoughts?
>>>      >>>>   >
>>>      >>>>   >       Niranjana
>>>      >>>>   >
>>>      >>>>   >       >Matt
>>>      >>>>   >       >
>>>      >>>>   >       >>
>>>      >>>>   >       >> Sorry I noticed this late.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> -Lionel
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-10  7:54                                   ` Niranjana Vishwanathapura
@ 2022-06-10  8:18                                     ` Lionel Landwerlin
  -1 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-10  8:18 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>> On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>
>>>>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>     <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>>       >
>>>>       >
>>>>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>> Vishwanathapura
>>>>       wrote:
>>>>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand 
>>>> wrote:
>>>>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>       >>>>
>>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
>>>> Landwerlin
>>>>       wrote:
>>>>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>       >>>>   >
>>>>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>> Vishwanathapura
>>>>       >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>       >>>>   >
>>>>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>       >>>>Brost wrote:
>>>>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>>       Landwerlin
>>>>       >>>>   wrote:
>>>>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>       wrote:
>>>>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>       >>>>   binding/unbinding
>>>>       >>>>   >       the mapping in an
>>>>       >>>>   >       >> > +async worker. The binding and unbinding 
>>>> will
>>>>       >>>>work like a
>>>>       >>>>   special
>>>>       >>>>   >       GPU engine.
>>>>       >>>>   >       >> > +The binding and unbinding operations are
>>>>       serialized and
>>>>       >>>>   will
>>>>       >>>>   >       wait on specified
>>>>       >>>>   >       >> > +input fences before the operation and 
>>>> will signal
>>>>       the
>>>>       >>>>   output
>>>>       >>>>   >       fences upon the
>>>>       >>>>   >       >> > +completion of the operation. Due to
>>>>       serialization,
>>>>       >>>>   completion of
>>>>       >>>>   >       an operation
>>>>       >>>>   >       >> > +will also indicate that all previous 
>>>> operations
>>>>       >>>>are also
>>>>       >>>>   >       complete.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I guess we should avoid saying "will 
>>>> immediately
>>>>       start
>>>>       >>>>   >       binding/unbinding" if
>>>>       >>>>   >       >> there are fences involved.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> And the fact that it's happening in an async
>>>>       >>>>worker seem to
>>>>       >>>>   imply
>>>>       >>>>   >       it's not
>>>>       >>>>   >       >> immediate.
>>>>       >>>>   >       >>
>>>>       >>>>   >
>>>>       >>>>   >       Ok, will fix.
>>>>       >>>>   >       This was added because in earlier design 
>>>> binding was
>>>>       deferred
>>>>       >>>>   until
>>>>       >>>>   >       next execbuff.
>>>>       >>>>   >       But now it is non-deferred (immediate in that 
>>>> sense).
>>>>       >>>>But yah,
>>>>       >>>>   this is
>>>>       >>>>   >       confusing
>>>>       >>>>   >       and will fix it.
>>>>       >>>>   >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I have a question on the behavior of the bind
>>>>       >>>>operation when
>>>>       >>>>   no
>>>>       >>>>   >       input fence
>>>>       >>>>   >       >> is provided. Let say I do :
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence1)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence2)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence3)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In what order are the fences going to be 
>>>> signaled?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of 
>>>> order?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Because you wrote "serialized I assume it's 
>>>> : in
>>>>       order
>>>>       >>>>   >       >>
>>>>       >>>>   >
>>>>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. 
>>>> Note that
>>>>       >>>>bind and
>>>>       >>>>   unbind
>>>>       >>>>   >       will use
>>>>       >>>>   >       the same queue and hence are ordered.
>>>>       >>>>   >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> One thing I didn't realize is that because 
>>>> we only
>>>>       get one
>>>>       >>>>   >       "VM_BIND" engine,
>>>>       >>>>   >       >> there is a disconnect from the Vulkan 
>>>> specification.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In Vulkan VM_BIND operations are serialized 
>>>> but
>>>>       >>>>per engine.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> So you could have something like this :
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>       out_fence=fence2)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>       out_fence=fence4)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> fence1 is not signaled
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> fence3 is signaled
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> So the second VM_BIND will proceed before the
>>>>       >>>>first VM_BIND.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I guess we can deal with that scenario in
>>>>       >>>>userspace by doing
>>>>       >>>>   the
>>>>       >>>>   >       wait
>>>>       >>>>   >       >> ourselves in one thread per engines.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> But then it makes the VM_BIND input fences 
>>>> useless.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Daniel : what do you think? Should be 
>>>> rework this or
>>>>       just
>>>>       >>>>   deal with
>>>>       >>>>   >       wait
>>>>       >>>>   >       >> fences in userspace?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >
>>>>       >>>>   >       >My opinion is rework this but make the 
>>>> ordering via
>>>>       >>>>an engine
>>>>       >>>>   param
>>>>       >>>>   >       optional.
>>>>       >>>>   >       >
>>>>       >>>>   >       >e.g. A VM can be configured so all binds are 
>>>> ordered
>>>>       >>>>within the
>>>>       >>>>   VM
>>>>       >>>>   >       >
>>>>       >>>>   >       >e.g. A VM can be configured so all binds 
>>>> accept an
>>>>       engine
>>>>       >>>>   argument
>>>>       >>>>   >       (in
>>>>       >>>>   >       >the case of the i915 likely this is a gem 
>>>> context
>>>>       >>>>handle) and
>>>>       >>>>   binds
>>>>       >>>>   >       >ordered with respect to that engine.
>>>>       >>>>   >       >
>>>>       >>>>   >       >This gives UMDs options as the later likely 
>>>> consumes
>>>>       >>>>more KMD
>>>>       >>>>   >       resources
>>>>       >>>>   >       >so if a different UMD can live with binds being
>>>>       >>>>ordered within
>>>>       >>>>   the VM
>>>>       >>>>   >       >they can use a mode consuming less resources.
>>>>       >>>>   >       >
>>>>       >>>>   >
>>>>       >>>>   >       I think we need to be careful here if we are 
>>>> looking
>>>>       for some
>>>>       >>>>   out of
>>>>       >>>>   >       (submission) order completion of vm_bind/unbind.
>>>>       >>>>   >       In-order completion means, in a batch of binds 
>>>> and
>>>>       >>>>unbinds to be
>>>>       >>>>   >       completed in-order, user only needs to specify
>>>>       >>>>in-fence for the
>>>>       >>>>   >       first bind/unbind call and the our-fence for 
>>>> the last
>>>>       >>>>   bind/unbind
>>>>       >>>>   >       call. Also, the VA released by an unbind call 
>>>> can be
>>>>       >>>>re-used by
>>>>       >>>>   >       any subsequent bind call in that in-order batch.
>>>>       >>>>   >
>>>>       >>>>   >       These things will break if binding/unbinding 
>>>> were to
>>>>       >>>>be allowed
>>>>       >>>>   to
>>>>       >>>>   >       go out of order (of submission) and user need 
>>>> to be
>>>>       extra
>>>>       >>>>   careful
>>>>       >>>>   >       not to run into pre-mature triggereing of 
>>>> out-fence and
>>>>       bind
>>>>       >>>>   failing
>>>>       >>>>   >       as VA is still in use etc.
>>>>       >>>>   >
>>>>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>>       specified
>>>>       >>>>   address
>>>>       >>>>   >       space
>>>>       >>>>   >       (VM). So, the uapi is not engine/context 
>>>> specific.
>>>>       >>>>   >
>>>>       >>>>   >       We can however add a 'queue' to the uapi which 
>>>> can be
>>>>       >>>>one from
>>>>       >>>>   the
>>>>       >>>>   >       pre-defined queues,
>>>>       >>>>   >       I915_VM_BIND_QUEUE_0
>>>>       >>>>   >       I915_VM_BIND_QUEUE_1
>>>>       >>>>   >       ...
>>>>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>       >>>>   >
>>>>       >>>>   >       KMD will spawn an async work queue for each 
>>>> queue which
>>>>       will
>>>>       >>>>   only
>>>>       >>>>   >       bind the mappings on that queue in the order of
>>>>       submission.
>>>>       >>>>   >       User can assign the queue to per engine or 
>>>> anything
>>>>       >>>>like that.
>>>>       >>>>   >
>>>>       >>>>   >       But again here, user need to be careful and not
>>>>       >>>>deadlock these
>>>>       >>>>   >       queues with circular dependency of fences.
>>>>       >>>>   >
>>>>       >>>>   >       I prefer adding this later an as extension 
>>>> based on
>>>>       >>>>whether it
>>>>       >>>>   >       is really helping with the implementation.
>>>>       >>>>   >
>>>>       >>>>   >     I can tell you right now that having everything 
>>>> on a
>>>>       single
>>>>       >>>>   in-order
>>>>       >>>>   >     queue will not get us the perf we want.  What 
>>>> vulkan
>>>>       >>>>really wants
>>>>       >>>>   is one
>>>>       >>>>   >     of two things:
>>>>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>>       happen in
>>>>       >>>>   whatever
>>>>       >>>>   >     their dependencies are resolved and we ensure 
>>>> ordering
>>>>       >>>>ourselves
>>>>       >>>>   by
>>>>       >>>>   >     having a syncobj in the VkQueue.
>>>>       >>>>   >      2. The ability to create multiple VM_BIND 
>>>> queues.  We
>>>>       need at
>>>>       >>>>   least 2
>>>>       >>>>   >     but I don't see why there needs to be a limit 
>>>> besides
>>>>       >>>>the limits
>>>>       >>>>   the
>>>>       >>>>   >     i915 API already has on the number of engines.  
>>>> Vulkan
>>>>       could
>>>>       >>>>   expose
>>>>       >>>>   >     multiple sparse binding queues to the client if 
>>>> it's not
>>>>       >>>>   arbitrarily
>>>>       >>>>   >     limited.
>>>>       >>>>
>>>>       >>>>   Thanks Jason, Lionel.
>>>>       >>>>
>>>>       >>>>   Jason, what are you referring to when you say "limits 
>>>> the i915
>>>>       API
>>>>       >>>>   already
>>>>       >>>>   has on the number of engines"? I am not sure if there 
>>>> is such
>>>>       an uapi
>>>>       >>>>   today.
>>>>       >>>>
>>>>       >>>> There's a limit of something like 64 total engines today 
>>>> based on
>>>>       the
>>>>       >>>> number of bits we can cram into the exec flags in 
>>>> execbuffer2.  I
>>>>       think
>>>>       >>>> someone had an extended version that allowed more but I 
>>>> ripped it
>>>>       out
>>>>       >>>> because no one was using it.  Of course, execbuffer3 
>>>> might not
>>>>       >>>>have that
>>>>       >>>> problem at all.
>>>>       >>>>
>>>>       >>>
>>>>       >>>Thanks Jason.
>>>>       >>>Ok, I am not sure which exec flag is that, but yah, 
>>>> execbuffer3
>>>>       probably
>>>>       >>>will not have this limiation. So, we need to define a
>>>>       VM_BIND_MAX_QUEUE
>>>>       >>>and somehow export it to user (I am thinking of embedding 
>>>> it in
>>>>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>>       meaning 2^n
>>>>       >>>queues.
>>>>       >>
>>>>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) 
>>>> which
>>>>       execbuf3
>>>>
>>>>     Yup!  That's exactly the limit I was talking about.
>>>>
>>>>       >>will also have. So, we can simply define in vm_bind/unbind
>>>>       structures,
>>>>       >>
>>>>       >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>       >>        __u32 queue;
>>>>       >>
>>>>       >>I think that will keep things simple.
>>>>       >
>>>>       >Hmmm? What does execbuf2 limit has to do with how many engines
>>>>       >hardware can have? I suggest not to do that.
>>>>       >
>>>>       >Change with added this:
>>>>       >
>>>>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>       >               return -EINVAL;
>>>>       >
>>>>       >To context creation needs to be undone and so let users 
>>>> create engine
>>>>       >maps with all hardware engines, and let execbuf3 access them 
>>>> all.
>>>>       >
>>>>
>>>>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>> execbuff3 also.
>>>>       Hence, I was using the same limit for VM_BIND queues (64, or 
>>>> 65 if we
>>>>       make it N+1).
>>>>       But, as discussed in other thread of this RFC series, we are 
>>>> planning
>>>>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>>       any uapi that limits the number of engines (and hence the 
>>>> vm_bind
>>>>       queues
>>>>       need to be supported).
>>>>
>>>>       If we leave the number of vm_bind queues to be arbitrarily large
>>>>       (__u32 queue_idx) then, we need to have a hashmap for queue 
>>>> (a wq,
>>>>       work_item and a linked list) lookup from the user specified 
>>>> queue
>>>>       index.
>>>>       Other option is to just put some hard limit (say 64 or 65) 
>>>> and use
>>>>       an array of queues in VM (each created upon first use). I 
>>>> prefer this.
>>>>
>>>>     I don't get why a VM_BIND queue is any different from any other 
>>>> queue or
>>>>     userspace-visible kernel object.  But I'll leave those details 
>>>> up to
>>>>     danvet or whoever else might be reviewing the implementation.
>>>>     --Jason
>>>>
>>>>   I kind of agree here. Wouldn't be simpler to have the bind queue 
>>>> created
>>>>   like the others when we build the engine map?
>>>>
>>>>   For userspace it's then just matter of selecting the right queue 
>>>> ID when
>>>>   submitting.
>>>>
>>>>   If there is ever a possibility to have this work on the GPU, it 
>>>> would be
>>>>   all ready.
>>>>
>>>
>>> I did sync offline with Matt Brost on this.
>>> We can add a VM_BIND engine class and let user create VM_BIND 
>>> engines (queues).
>>> The problem is, in i915 engine creating interface is bound to 
>>> gem_context.
>>> So, in vm_bind ioctl, we would need both context_id and queue_idx 
>>> for proper
>>> lookup of the user created engine. This is bit ackward as vm_bind is an
>>> interface to VM (address space) and has nothing to do with gem_context.
>>
>>
>> A gem_context has a single vm object right?
>>
>> Set through I915_CONTEXT_PARAM_VM at creation or given a default one 
>> if not.
>>
>> So it's just like picking up the vm like it's done at execbuffer time 
>> right now : eb->context->vm
>>
>
> Are you suggesting replacing 'vm_id' with 'context_id' in the 
> VM_BIND/UNBIND
> ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be 
> obtained
> from the context?


Yes, because if we go for engines, they're associated with a context and 
so also associated with the VM bound to the context.


> I think the interface is clean as a interface to VM. It is only that we
> don't have a clean way to create a raw VM_BIND engine (not associated 
> with
> any context) with i915 uapi.
> May be we can add such an interface, but I don't think that is worth it
> (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I 
> mentioned
> above).
> Anyone has any thoughts?
>
>>
>>> Another problem is, if two VMs are binding with the same defined 
>>> engine,
>>> binding on VM1 can get unnecessary blocked by binding on VM2 (which 
>>> may be
>>> waiting on its in_fence).
>>
>>
>> Maybe I'm missing something, but how can you have 2 vm objects with a 
>> single gem_context right now?
>>
>
> No, we don't have 2 VMs for a gem_context.
> Say if ctx1 with vm1 and ctx2 with vm2.
> First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If 
> those two queue indicies points to same underlying vm_bind engine,
> then the second vm_bind call gets blocked until the first vm_bind call's
> 'in' fence is triggered and bind completes.
>
> With per VM queues, this is not a problem as two VMs will not endup
> sharing same queue.
>
> BTW, I just posted a updated PATCH series.
> https://www.spinics.net/lists/dri-devel/msg350483.html
>
> Niranjana
>
>>
>>>
>>> So, my preference here is to just add a 'u32 queue' index in 
>>> vm_bind/unbind
>>> ioctl, and the queues are per VM.
>>>
>>> Niranjana
>>>
>>>>   Thanks,
>>>>
>>>>   -Lionel
>>>>
>>>>
>>>>       Niranjana
>>>>
>>>>       >Regards,
>>>>       >
>>>>       >Tvrtko
>>>>       >
>>>>       >>
>>>>       >>Niranjana
>>>>       >>
>>>>       >>>
>>>>       >>>>   I am trying to see how many queues we need and don't 
>>>> want it to
>>>>       be
>>>>       >>>>   arbitrarily
>>>>       >>>>   large and unduely blow up memory usage and complexity 
>>>> in i915
>>>>       driver.
>>>>       >>>>
>>>>       >>>> I expect a Vulkan driver to use at most 2 in the vast 
>>>> majority
>>>>       >>>>of cases. I
>>>>       >>>> could imagine a client wanting to create more than 1 sparse
>>>>       >>>>queue in which
>>>>       >>>> case, it'll be N+1 but that's unlikely. As far as 
>>>> complexity
>>>>       >>>>goes, once
>>>>       >>>> you allow two, I don't think the complexity is going up by
>>>>       >>>>allowing N.  As
>>>>       >>>> for memory usage, creating more queues means more 
>>>> memory.  That's
>>>>       a
>>>>       >>>> trade-off that userspace can make. Again, the expected 
>>>> number
>>>>       >>>>here is 1
>>>>       >>>> or 2 in the vast majority of cases so I don't think you 
>>>> need to
>>>>       worry.
>>>>       >>>
>>>>       >>>Ok, will start with n=3 meaning 8 queues.
>>>>       >>>That would require us create 8 workqueues.
>>>>       >>>We can change 'n' later if required.
>>>>       >>>
>>>>       >>>Niranjana
>>>>       >>>
>>>>       >>>>
>>>>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>       >>>>operations and we
>>>>       >>>>   don't
>>>>       >>>>   >     want any dependencies between them:
>>>>       >>>>   >      1. Immediate.  These happen right after BO 
>>>> creation or
>>>>       >>>>maybe as
>>>>       >>>>   part of
>>>>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>       >>>>don't happen
>>>>       >>>>   on a
>>>>       >>>>   >     queue and we don't want them serialized with 
>>>> anything.       To
>>>>       >>>>   synchronize
>>>>       >>>>   >     with submit, we'll have a syncobj in the 
>>>> VkDevice which
>>>>       is
>>>>       >>>>   signaled by
>>>>       >>>>   >     all immediate bind operations and make submits 
>>>> wait on
>>>>       it.
>>>>       >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>>>> which may
>>>>       be the
>>>>       >>>>   same as
>>>>       >>>>   >     a render/compute queue or may be its own queue.  
>>>> It's up
>>>>       to us
>>>>       >>>>   what we
>>>>       >>>>   >     want to advertise.  From the Vulkan API PoV, 
>>>> this is like
>>>>       any
>>>>       >>>>   other
>>>>       >>>>   >     queue.  Operations on it wait on and signal 
>>>> semaphores.       If we
>>>>       >>>>   have a
>>>>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>       >>>>signal just like
>>>>       >>>>   we do
>>>>       >>>>   >     in execbuf().
>>>>       >>>>   >     The important thing is that we don't want one 
>>>> type of
>>>>       >>>>operation to
>>>>       >>>>   block
>>>>       >>>>   >     on the other.  If immediate binds are blocking 
>>>> on sparse
>>>>       binds,
>>>>       >>>>   it's
>>>>       >>>>   >     going to cause over-synchronization issues.
>>>>       >>>>   >     In terms of the internal implementation, I know 
>>>> that
>>>>       >>>>there's going
>>>>       >>>>   to be
>>>>       >>>>   >     a lock on the VM and that we can't actually do 
>>>> these
>>>>       things in
>>>>       >>>>   >     parallel.  That's fine. Once the dma_fences have
>>>>       signaled and
>>>>       >>>>   we're
>>>>       >>>>
>>>>       >>>>   Thats correct. It is like a single VM_BIND engine with
>>>>       >>>>multiple queues
>>>>       >>>>   feeding to it.
>>>>       >>>>
>>>>       >>>> Right.  As long as the queues themselves are independent 
>>>> and
>>>>       >>>>can block on
>>>>       >>>> dma_fences without holding up other queues, I think 
>>>> we're fine.
>>>>       >>>>
>>>>       >>>>   >     unblocked to do the bind operation, I don't care if
>>>>       >>>>there's a bit
>>>>       >>>>   of
>>>>       >>>>   >     synchronization due to locking.  That's 
>>>> expected.  What
>>>>       >>>>we can't
>>>>       >>>>   afford
>>>>       >>>>   >     to have is an immediate bind operation suddenly 
>>>> blocking
>>>>       on a
>>>>       >>>>   sparse
>>>>       >>>>   >     operation which is blocked on a compute job 
>>>> that's going
>>>>       to run
>>>>       >>>>   for
>>>>       >>>>   >     another 5ms.
>>>>       >>>>
>>>>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>> doesn't block
>>>>       the
>>>>       >>>>   VM_BIND
>>>>       >>>>   on other VMs. I am not sure about usecases here, but just
>>>>       wanted to
>>>>       >>>>   clarify.
>>>>       >>>>
>>>>       >>>> Yes, that's what I would expect.
>>>>       >>>> --Jason
>>>>       >>>>
>>>>       >>>>   Niranjana
>>>>       >>>>
>>>>       >>>>   >     For reference, Windows solves this by allowing
>>>>       arbitrarily many
>>>>       >>>>   paging
>>>>       >>>>   >     queues (what they call a VM_BIND engine/queue).  
>>>> That
>>>>       >>>>design works
>>>>       >>>>   >     pretty well and solves the problems in question. 
>>>>       >>>>Again, we could
>>>>       >>>>   just
>>>>       >>>>   >     make everything out-of-order and require using 
>>>> syncobjs
>>>>       >>>>to order
>>>>       >>>>   things
>>>>       >>>>   >     as userspace wants. That'd be fine too.
>>>>       >>>>   >     One more note while I'm here: danvet said 
>>>> something on
>>>>       >>>>IRC about
>>>>       >>>>   VM_BIND
>>>>       >>>>   >     queues waiting for syncobjs to materialize.  We 
>>>> don't
>>>>       really
>>>>       >>>>   want/need
>>>>       >>>>   >     this.  We already have all the machinery in 
>>>> userspace to
>>>>       handle
>>>>       >>>>   >     wait-before-signal and waiting for syncobj 
>>>> fences to
>>>>       >>>>materialize
>>>>       >>>>   and
>>>>       >>>>   >     that machinery is on by default.  It would actually
>>>>       >>>>take MORE work
>>>>       >>>>   in
>>>>       >>>>   >     Mesa to turn it off and take advantage of the 
>>>> kernel
>>>>       >>>>being able to
>>>>       >>>>   wait
>>>>       >>>>   >     for syncobjs to materialize. Also, getting that 
>>>> right is
>>>>       >>>>   ridiculously
>>>>       >>>>   >     hard and I really don't want to get it wrong in 
>>>> kernel
>>>>       >>>>space.   �� When we
>>>>       >>>>   >     do memory fences, wait-before-signal will be a 
>>>> thing.  We
>>>>       don't
>>>>       >>>>   need to
>>>>       >>>>   >     try and make it a thing for syncobj.
>>>>       >>>>   >     --Jason
>>>>       >>>>   >
>>>>       >>>>   >   Thanks Jason,
>>>>       >>>>   >
>>>>       >>>>   >   I missed the bit in the Vulkan spec that we're 
>>>> allowed to
>>>>       have a
>>>>       >>>>   sparse
>>>>       >>>>   >   queue that does not implement either graphics or 
>>>> compute
>>>>       >>>>operations
>>>>       >>>>   :
>>>>       >>>>   >
>>>>       >>>>   >     "While some implementations may include
>>>>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>       >>>>   >     support in queue families that also include
>>>>       >>>>   >
>>>>       >>>>   >      graphics and compute support, other 
>>>> implementations may
>>>>       only
>>>>       >>>>   expose a
>>>>       >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>       >>>>   >
>>>>       >>>>   >      family."
>>>>       >>>>   >
>>>>       >>>>   >   So it can all be all a vm_bind engine that just does
>>>>       bind/unbind
>>>>       >>>>   >   operations.
>>>>       >>>>   >
>>>>       >>>>   >   But yes we need another engine for the 
>>>> immediate/non-sparse
>>>>       >>>>   operations.
>>>>       >>>>   >
>>>>       >>>>   >   -Lionel
>>>>       >>>>   >
>>>>       >>>>   >         >
>>>>       >>>>   >       Daniel, any thoughts?
>>>>       >>>>   >
>>>>       >>>>   >       Niranjana
>>>>       >>>>   >
>>>>       >>>>   >       >Matt
>>>>       >>>>   >       >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Sorry I noticed this late.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> -Lionel
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-10  8:18                                     ` Lionel Landwerlin
  0 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-10  8:18 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>> On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>
>>>>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>     <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>>       >
>>>>       >
>>>>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>> Vishwanathapura
>>>>       wrote:
>>>>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand 
>>>> wrote:
>>>>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>       >>>>
>>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
>>>> Landwerlin
>>>>       wrote:
>>>>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>       >>>>   >
>>>>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>> Vishwanathapura
>>>>       >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>       >>>>   >
>>>>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>       >>>>Brost wrote:
>>>>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>>       Landwerlin
>>>>       >>>>   wrote:
>>>>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>       wrote:
>>>>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>       >>>>   binding/unbinding
>>>>       >>>>   >       the mapping in an
>>>>       >>>>   >       >> > +async worker. The binding and unbinding 
>>>> will
>>>>       >>>>work like a
>>>>       >>>>   special
>>>>       >>>>   >       GPU engine.
>>>>       >>>>   >       >> > +The binding and unbinding operations are
>>>>       serialized and
>>>>       >>>>   will
>>>>       >>>>   >       wait on specified
>>>>       >>>>   >       >> > +input fences before the operation and 
>>>> will signal
>>>>       the
>>>>       >>>>   output
>>>>       >>>>   >       fences upon the
>>>>       >>>>   >       >> > +completion of the operation. Due to
>>>>       serialization,
>>>>       >>>>   completion of
>>>>       >>>>   >       an operation
>>>>       >>>>   >       >> > +will also indicate that all previous 
>>>> operations
>>>>       >>>>are also
>>>>       >>>>   >       complete.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I guess we should avoid saying "will 
>>>> immediately
>>>>       start
>>>>       >>>>   >       binding/unbinding" if
>>>>       >>>>   >       >> there are fences involved.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> And the fact that it's happening in an async
>>>>       >>>>worker seem to
>>>>       >>>>   imply
>>>>       >>>>   >       it's not
>>>>       >>>>   >       >> immediate.
>>>>       >>>>   >       >>
>>>>       >>>>   >
>>>>       >>>>   >       Ok, will fix.
>>>>       >>>>   >       This was added because in earlier design 
>>>> binding was
>>>>       deferred
>>>>       >>>>   until
>>>>       >>>>   >       next execbuff.
>>>>       >>>>   >       But now it is non-deferred (immediate in that 
>>>> sense).
>>>>       >>>>But yah,
>>>>       >>>>   this is
>>>>       >>>>   >       confusing
>>>>       >>>>   >       and will fix it.
>>>>       >>>>   >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I have a question on the behavior of the bind
>>>>       >>>>operation when
>>>>       >>>>   no
>>>>       >>>>   >       input fence
>>>>       >>>>   >       >> is provided. Let say I do :
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence1)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence2)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence3)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In what order are the fences going to be 
>>>> signaled?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of 
>>>> order?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Because you wrote "serialized I assume it's 
>>>> : in
>>>>       order
>>>>       >>>>   >       >>
>>>>       >>>>   >
>>>>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. 
>>>> Note that
>>>>       >>>>bind and
>>>>       >>>>   unbind
>>>>       >>>>   >       will use
>>>>       >>>>   >       the same queue and hence are ordered.
>>>>       >>>>   >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> One thing I didn't realize is that because 
>>>> we only
>>>>       get one
>>>>       >>>>   >       "VM_BIND" engine,
>>>>       >>>>   >       >> there is a disconnect from the Vulkan 
>>>> specification.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In Vulkan VM_BIND operations are serialized 
>>>> but
>>>>       >>>>per engine.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> So you could have something like this :
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>       out_fence=fence2)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>       out_fence=fence4)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> fence1 is not signaled
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> fence3 is signaled
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> So the second VM_BIND will proceed before the
>>>>       >>>>first VM_BIND.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I guess we can deal with that scenario in
>>>>       >>>>userspace by doing
>>>>       >>>>   the
>>>>       >>>>   >       wait
>>>>       >>>>   >       >> ourselves in one thread per engines.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> But then it makes the VM_BIND input fences 
>>>> useless.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Daniel : what do you think? Should be 
>>>> rework this or
>>>>       just
>>>>       >>>>   deal with
>>>>       >>>>   >       wait
>>>>       >>>>   >       >> fences in userspace?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >
>>>>       >>>>   >       >My opinion is rework this but make the 
>>>> ordering via
>>>>       >>>>an engine
>>>>       >>>>   param
>>>>       >>>>   >       optional.
>>>>       >>>>   >       >
>>>>       >>>>   >       >e.g. A VM can be configured so all binds are 
>>>> ordered
>>>>       >>>>within the
>>>>       >>>>   VM
>>>>       >>>>   >       >
>>>>       >>>>   >       >e.g. A VM can be configured so all binds 
>>>> accept an
>>>>       engine
>>>>       >>>>   argument
>>>>       >>>>   >       (in
>>>>       >>>>   >       >the case of the i915 likely this is a gem 
>>>> context
>>>>       >>>>handle) and
>>>>       >>>>   binds
>>>>       >>>>   >       >ordered with respect to that engine.
>>>>       >>>>   >       >
>>>>       >>>>   >       >This gives UMDs options as the later likely 
>>>> consumes
>>>>       >>>>more KMD
>>>>       >>>>   >       resources
>>>>       >>>>   >       >so if a different UMD can live with binds being
>>>>       >>>>ordered within
>>>>       >>>>   the VM
>>>>       >>>>   >       >they can use a mode consuming less resources.
>>>>       >>>>   >       >
>>>>       >>>>   >
>>>>       >>>>   >       I think we need to be careful here if we are 
>>>> looking
>>>>       for some
>>>>       >>>>   out of
>>>>       >>>>   >       (submission) order completion of vm_bind/unbind.
>>>>       >>>>   >       In-order completion means, in a batch of binds 
>>>> and
>>>>       >>>>unbinds to be
>>>>       >>>>   >       completed in-order, user only needs to specify
>>>>       >>>>in-fence for the
>>>>       >>>>   >       first bind/unbind call and the our-fence for 
>>>> the last
>>>>       >>>>   bind/unbind
>>>>       >>>>   >       call. Also, the VA released by an unbind call 
>>>> can be
>>>>       >>>>re-used by
>>>>       >>>>   >       any subsequent bind call in that in-order batch.
>>>>       >>>>   >
>>>>       >>>>   >       These things will break if binding/unbinding 
>>>> were to
>>>>       >>>>be allowed
>>>>       >>>>   to
>>>>       >>>>   >       go out of order (of submission) and user need 
>>>> to be
>>>>       extra
>>>>       >>>>   careful
>>>>       >>>>   >       not to run into pre-mature triggereing of 
>>>> out-fence and
>>>>       bind
>>>>       >>>>   failing
>>>>       >>>>   >       as VA is still in use etc.
>>>>       >>>>   >
>>>>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>>       specified
>>>>       >>>>   address
>>>>       >>>>   >       space
>>>>       >>>>   >       (VM). So, the uapi is not engine/context 
>>>> specific.
>>>>       >>>>   >
>>>>       >>>>   >       We can however add a 'queue' to the uapi which 
>>>> can be
>>>>       >>>>one from
>>>>       >>>>   the
>>>>       >>>>   >       pre-defined queues,
>>>>       >>>>   >       I915_VM_BIND_QUEUE_0
>>>>       >>>>   >       I915_VM_BIND_QUEUE_1
>>>>       >>>>   >       ...
>>>>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>       >>>>   >
>>>>       >>>>   >       KMD will spawn an async work queue for each 
>>>> queue which
>>>>       will
>>>>       >>>>   only
>>>>       >>>>   >       bind the mappings on that queue in the order of
>>>>       submission.
>>>>       >>>>   >       User can assign the queue to per engine or 
>>>> anything
>>>>       >>>>like that.
>>>>       >>>>   >
>>>>       >>>>   >       But again here, user need to be careful and not
>>>>       >>>>deadlock these
>>>>       >>>>   >       queues with circular dependency of fences.
>>>>       >>>>   >
>>>>       >>>>   >       I prefer adding this later an as extension 
>>>> based on
>>>>       >>>>whether it
>>>>       >>>>   >       is really helping with the implementation.
>>>>       >>>>   >
>>>>       >>>>   >     I can tell you right now that having everything 
>>>> on a
>>>>       single
>>>>       >>>>   in-order
>>>>       >>>>   >     queue will not get us the perf we want.  What 
>>>> vulkan
>>>>       >>>>really wants
>>>>       >>>>   is one
>>>>       >>>>   >     of two things:
>>>>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>>       happen in
>>>>       >>>>   whatever
>>>>       >>>>   >     their dependencies are resolved and we ensure 
>>>> ordering
>>>>       >>>>ourselves
>>>>       >>>>   by
>>>>       >>>>   >     having a syncobj in the VkQueue.
>>>>       >>>>   >      2. The ability to create multiple VM_BIND 
>>>> queues.  We
>>>>       need at
>>>>       >>>>   least 2
>>>>       >>>>   >     but I don't see why there needs to be a limit 
>>>> besides
>>>>       >>>>the limits
>>>>       >>>>   the
>>>>       >>>>   >     i915 API already has on the number of engines.  
>>>> Vulkan
>>>>       could
>>>>       >>>>   expose
>>>>       >>>>   >     multiple sparse binding queues to the client if 
>>>> it's not
>>>>       >>>>   arbitrarily
>>>>       >>>>   >     limited.
>>>>       >>>>
>>>>       >>>>   Thanks Jason, Lionel.
>>>>       >>>>
>>>>       >>>>   Jason, what are you referring to when you say "limits 
>>>> the i915
>>>>       API
>>>>       >>>>   already
>>>>       >>>>   has on the number of engines"? I am not sure if there 
>>>> is such
>>>>       an uapi
>>>>       >>>>   today.
>>>>       >>>>
>>>>       >>>> There's a limit of something like 64 total engines today 
>>>> based on
>>>>       the
>>>>       >>>> number of bits we can cram into the exec flags in 
>>>> execbuffer2.  I
>>>>       think
>>>>       >>>> someone had an extended version that allowed more but I 
>>>> ripped it
>>>>       out
>>>>       >>>> because no one was using it.  Of course, execbuffer3 
>>>> might not
>>>>       >>>>have that
>>>>       >>>> problem at all.
>>>>       >>>>
>>>>       >>>
>>>>       >>>Thanks Jason.
>>>>       >>>Ok, I am not sure which exec flag is that, but yah, 
>>>> execbuffer3
>>>>       probably
>>>>       >>>will not have this limiation. So, we need to define a
>>>>       VM_BIND_MAX_QUEUE
>>>>       >>>and somehow export it to user (I am thinking of embedding 
>>>> it in
>>>>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>>       meaning 2^n
>>>>       >>>queues.
>>>>       >>
>>>>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) 
>>>> which
>>>>       execbuf3
>>>>
>>>>     Yup!  That's exactly the limit I was talking about.
>>>>
>>>>       >>will also have. So, we can simply define in vm_bind/unbind
>>>>       structures,
>>>>       >>
>>>>       >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>       >>        __u32 queue;
>>>>       >>
>>>>       >>I think that will keep things simple.
>>>>       >
>>>>       >Hmmm? What does execbuf2 limit has to do with how many engines
>>>>       >hardware can have? I suggest not to do that.
>>>>       >
>>>>       >Change with added this:
>>>>       >
>>>>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>       >               return -EINVAL;
>>>>       >
>>>>       >To context creation needs to be undone and so let users 
>>>> create engine
>>>>       >maps with all hardware engines, and let execbuf3 access them 
>>>> all.
>>>>       >
>>>>
>>>>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>> execbuff3 also.
>>>>       Hence, I was using the same limit for VM_BIND queues (64, or 
>>>> 65 if we
>>>>       make it N+1).
>>>>       But, as discussed in other thread of this RFC series, we are 
>>>> planning
>>>>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>>       any uapi that limits the number of engines (and hence the 
>>>> vm_bind
>>>>       queues
>>>>       need to be supported).
>>>>
>>>>       If we leave the number of vm_bind queues to be arbitrarily large
>>>>       (__u32 queue_idx) then, we need to have a hashmap for queue 
>>>> (a wq,
>>>>       work_item and a linked list) lookup from the user specified 
>>>> queue
>>>>       index.
>>>>       Other option is to just put some hard limit (say 64 or 65) 
>>>> and use
>>>>       an array of queues in VM (each created upon first use). I 
>>>> prefer this.
>>>>
>>>>     I don't get why a VM_BIND queue is any different from any other 
>>>> queue or
>>>>     userspace-visible kernel object.  But I'll leave those details 
>>>> up to
>>>>     danvet or whoever else might be reviewing the implementation.
>>>>     --Jason
>>>>
>>>>   I kind of agree here. Wouldn't be simpler to have the bind queue 
>>>> created
>>>>   like the others when we build the engine map?
>>>>
>>>>   For userspace it's then just matter of selecting the right queue 
>>>> ID when
>>>>   submitting.
>>>>
>>>>   If there is ever a possibility to have this work on the GPU, it 
>>>> would be
>>>>   all ready.
>>>>
>>>
>>> I did sync offline with Matt Brost on this.
>>> We can add a VM_BIND engine class and let user create VM_BIND 
>>> engines (queues).
>>> The problem is, in i915 engine creating interface is bound to 
>>> gem_context.
>>> So, in vm_bind ioctl, we would need both context_id and queue_idx 
>>> for proper
>>> lookup of the user created engine. This is bit ackward as vm_bind is an
>>> interface to VM (address space) and has nothing to do with gem_context.
>>
>>
>> A gem_context has a single vm object right?
>>
>> Set through I915_CONTEXT_PARAM_VM at creation or given a default one 
>> if not.
>>
>> So it's just like picking up the vm like it's done at execbuffer time 
>> right now : eb->context->vm
>>
>
> Are you suggesting replacing 'vm_id' with 'context_id' in the 
> VM_BIND/UNBIND
> ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be 
> obtained
> from the context?


Yes, because if we go for engines, they're associated with a context and 
so also associated with the VM bound to the context.


> I think the interface is clean as a interface to VM. It is only that we
> don't have a clean way to create a raw VM_BIND engine (not associated 
> with
> any context) with i915 uapi.
> May be we can add such an interface, but I don't think that is worth it
> (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I 
> mentioned
> above).
> Anyone has any thoughts?
>
>>
>>> Another problem is, if two VMs are binding with the same defined 
>>> engine,
>>> binding on VM1 can get unnecessary blocked by binding on VM2 (which 
>>> may be
>>> waiting on its in_fence).
>>
>>
>> Maybe I'm missing something, but how can you have 2 vm objects with a 
>> single gem_context right now?
>>
>
> No, we don't have 2 VMs for a gem_context.
> Say if ctx1 with vm1 and ctx2 with vm2.
> First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If 
> those two queue indicies points to same underlying vm_bind engine,
> then the second vm_bind call gets blocked until the first vm_bind call's
> 'in' fence is triggered and bind completes.
>
> With per VM queues, this is not a problem as two VMs will not endup
> sharing same queue.
>
> BTW, I just posted a updated PATCH series.
> https://www.spinics.net/lists/dri-devel/msg350483.html
>
> Niranjana
>
>>
>>>
>>> So, my preference here is to just add a 'u32 queue' index in 
>>> vm_bind/unbind
>>> ioctl, and the queues are per VM.
>>>
>>> Niranjana
>>>
>>>>   Thanks,
>>>>
>>>>   -Lionel
>>>>
>>>>
>>>>       Niranjana
>>>>
>>>>       >Regards,
>>>>       >
>>>>       >Tvrtko
>>>>       >
>>>>       >>
>>>>       >>Niranjana
>>>>       >>
>>>>       >>>
>>>>       >>>>   I am trying to see how many queues we need and don't 
>>>> want it to
>>>>       be
>>>>       >>>>   arbitrarily
>>>>       >>>>   large and unduely blow up memory usage and complexity 
>>>> in i915
>>>>       driver.
>>>>       >>>>
>>>>       >>>> I expect a Vulkan driver to use at most 2 in the vast 
>>>> majority
>>>>       >>>>of cases. I
>>>>       >>>> could imagine a client wanting to create more than 1 sparse
>>>>       >>>>queue in which
>>>>       >>>> case, it'll be N+1 but that's unlikely. As far as 
>>>> complexity
>>>>       >>>>goes, once
>>>>       >>>> you allow two, I don't think the complexity is going up by
>>>>       >>>>allowing N.  As
>>>>       >>>> for memory usage, creating more queues means more 
>>>> memory.  That's
>>>>       a
>>>>       >>>> trade-off that userspace can make. Again, the expected 
>>>> number
>>>>       >>>>here is 1
>>>>       >>>> or 2 in the vast majority of cases so I don't think you 
>>>> need to
>>>>       worry.
>>>>       >>>
>>>>       >>>Ok, will start with n=3 meaning 8 queues.
>>>>       >>>That would require us create 8 workqueues.
>>>>       >>>We can change 'n' later if required.
>>>>       >>>
>>>>       >>>Niranjana
>>>>       >>>
>>>>       >>>>
>>>>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>       >>>>operations and we
>>>>       >>>>   don't
>>>>       >>>>   >     want any dependencies between them:
>>>>       >>>>   >      1. Immediate.  These happen right after BO 
>>>> creation or
>>>>       >>>>maybe as
>>>>       >>>>   part of
>>>>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>       >>>>don't happen
>>>>       >>>>   on a
>>>>       >>>>   >     queue and we don't want them serialized with 
>>>> anything.       To
>>>>       >>>>   synchronize
>>>>       >>>>   >     with submit, we'll have a syncobj in the 
>>>> VkDevice which
>>>>       is
>>>>       >>>>   signaled by
>>>>       >>>>   >     all immediate bind operations and make submits 
>>>> wait on
>>>>       it.
>>>>       >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>>>> which may
>>>>       be the
>>>>       >>>>   same as
>>>>       >>>>   >     a render/compute queue or may be its own queue.  
>>>> It's up
>>>>       to us
>>>>       >>>>   what we
>>>>       >>>>   >     want to advertise.  From the Vulkan API PoV, 
>>>> this is like
>>>>       any
>>>>       >>>>   other
>>>>       >>>>   >     queue.  Operations on it wait on and signal 
>>>> semaphores.       If we
>>>>       >>>>   have a
>>>>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>       >>>>signal just like
>>>>       >>>>   we do
>>>>       >>>>   >     in execbuf().
>>>>       >>>>   >     The important thing is that we don't want one 
>>>> type of
>>>>       >>>>operation to
>>>>       >>>>   block
>>>>       >>>>   >     on the other.  If immediate binds are blocking 
>>>> on sparse
>>>>       binds,
>>>>       >>>>   it's
>>>>       >>>>   >     going to cause over-synchronization issues.
>>>>       >>>>   >     In terms of the internal implementation, I know 
>>>> that
>>>>       >>>>there's going
>>>>       >>>>   to be
>>>>       >>>>   >     a lock on the VM and that we can't actually do 
>>>> these
>>>>       things in
>>>>       >>>>   >     parallel.  That's fine. Once the dma_fences have
>>>>       signaled and
>>>>       >>>>   we're
>>>>       >>>>
>>>>       >>>>   Thats correct. It is like a single VM_BIND engine with
>>>>       >>>>multiple queues
>>>>       >>>>   feeding to it.
>>>>       >>>>
>>>>       >>>> Right.  As long as the queues themselves are independent 
>>>> and
>>>>       >>>>can block on
>>>>       >>>> dma_fences without holding up other queues, I think 
>>>> we're fine.
>>>>       >>>>
>>>>       >>>>   >     unblocked to do the bind operation, I don't care if
>>>>       >>>>there's a bit
>>>>       >>>>   of
>>>>       >>>>   >     synchronization due to locking.  That's 
>>>> expected.  What
>>>>       >>>>we can't
>>>>       >>>>   afford
>>>>       >>>>   >     to have is an immediate bind operation suddenly 
>>>> blocking
>>>>       on a
>>>>       >>>>   sparse
>>>>       >>>>   >     operation which is blocked on a compute job 
>>>> that's going
>>>>       to run
>>>>       >>>>   for
>>>>       >>>>   >     another 5ms.
>>>>       >>>>
>>>>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>> doesn't block
>>>>       the
>>>>       >>>>   VM_BIND
>>>>       >>>>   on other VMs. I am not sure about usecases here, but just
>>>>       wanted to
>>>>       >>>>   clarify.
>>>>       >>>>
>>>>       >>>> Yes, that's what I would expect.
>>>>       >>>> --Jason
>>>>       >>>>
>>>>       >>>>   Niranjana
>>>>       >>>>
>>>>       >>>>   >     For reference, Windows solves this by allowing
>>>>       arbitrarily many
>>>>       >>>>   paging
>>>>       >>>>   >     queues (what they call a VM_BIND engine/queue).  
>>>> That
>>>>       >>>>design works
>>>>       >>>>   >     pretty well and solves the problems in question. 
>>>>       >>>>Again, we could
>>>>       >>>>   just
>>>>       >>>>   >     make everything out-of-order and require using 
>>>> syncobjs
>>>>       >>>>to order
>>>>       >>>>   things
>>>>       >>>>   >     as userspace wants. That'd be fine too.
>>>>       >>>>   >     One more note while I'm here: danvet said 
>>>> something on
>>>>       >>>>IRC about
>>>>       >>>>   VM_BIND
>>>>       >>>>   >     queues waiting for syncobjs to materialize.  We 
>>>> don't
>>>>       really
>>>>       >>>>   want/need
>>>>       >>>>   >     this.  We already have all the machinery in 
>>>> userspace to
>>>>       handle
>>>>       >>>>   >     wait-before-signal and waiting for syncobj 
>>>> fences to
>>>>       >>>>materialize
>>>>       >>>>   and
>>>>       >>>>   >     that machinery is on by default.  It would actually
>>>>       >>>>take MORE work
>>>>       >>>>   in
>>>>       >>>>   >     Mesa to turn it off and take advantage of the 
>>>> kernel
>>>>       >>>>being able to
>>>>       >>>>   wait
>>>>       >>>>   >     for syncobjs to materialize. Also, getting that 
>>>> right is
>>>>       >>>>   ridiculously
>>>>       >>>>   >     hard and I really don't want to get it wrong in 
>>>> kernel
>>>>       >>>>space.   �� When we
>>>>       >>>>   >     do memory fences, wait-before-signal will be a 
>>>> thing.  We
>>>>       don't
>>>>       >>>>   need to
>>>>       >>>>   >     try and make it a thing for syncobj.
>>>>       >>>>   >     --Jason
>>>>       >>>>   >
>>>>       >>>>   >   Thanks Jason,
>>>>       >>>>   >
>>>>       >>>>   >   I missed the bit in the Vulkan spec that we're 
>>>> allowed to
>>>>       have a
>>>>       >>>>   sparse
>>>>       >>>>   >   queue that does not implement either graphics or 
>>>> compute
>>>>       >>>>operations
>>>>       >>>>   :
>>>>       >>>>   >
>>>>       >>>>   >     "While some implementations may include
>>>>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>       >>>>   >     support in queue families that also include
>>>>       >>>>   >
>>>>       >>>>   >      graphics and compute support, other 
>>>> implementations may
>>>>       only
>>>>       >>>>   expose a
>>>>       >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>       >>>>   >
>>>>       >>>>   >      family."
>>>>       >>>>   >
>>>>       >>>>   >   So it can all be all a vm_bind engine that just does
>>>>       bind/unbind
>>>>       >>>>   >   operations.
>>>>       >>>>   >
>>>>       >>>>   >   But yes we need another engine for the 
>>>> immediate/non-sparse
>>>>       >>>>   operations.
>>>>       >>>>   >
>>>>       >>>>   >   -Lionel
>>>>       >>>>   >
>>>>       >>>>   >         >
>>>>       >>>>   >       Daniel, any thoughts?
>>>>       >>>>   >
>>>>       >>>>   >       Niranjana
>>>>       >>>>   >
>>>>       >>>>   >       >Matt
>>>>       >>>>   >       >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Sorry I noticed this late.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> -Lionel
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32   ` Niranjana Vishwanathapura
@ 2022-06-10  8:34     ` Matthew Brost
  -1 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-10  8:34 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel, jason,
	daniel.vetter, christian.koenig

On Tue, May 17, 2022 at 11:32:12AM -0700, Niranjana Vishwanathapura wrote:
> VM_BIND and related uapi definitions
> 
> v2: Ensure proper kernel-doc formatting with cross references.
>     Also add new uapi and documentation as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>  1 file changed, 399 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..589c0a009107
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,399 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND		57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> + * to pass in the batch buffer addresses.
> + *
> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> + */
> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
> +
> +/**
> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> + *
> + * Flag to declare context as long running.
> + * See struct drm_i915_gem_context_create_ext flags.
> + *
> + * Usage of dma-fence expects that they complete in reasonable amount of time.
> + * Compute on the other hand can be long running. Hence it is not appropriate
> + * for compute contexts to export request completion dma-fence to user.
> + * The dma-fence usage will be limited to in-kernel consumption only.
> + * Compute contexts need to use user/memory fence.
> + *
> + * So, long running contexts do not support output fences. Hence,
> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
> + * to be not used.
> + *
> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
> + * to long running contexts.
> + */
> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
> +
> +/* VM_BIND related ioctls */
> +#define DRM_I915_GEM_VM_BIND		0x3d
> +#define DRM_I915_GEM_VM_UNBIND		0x3e
> +#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
> +
> +#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
> +
> +/**
> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
> + *
> + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
> + * virtual address (VA) range to the section of an object that should be bound
> + * in the device page table of the specified address space (VM).
> + * The VA range specified must be unique (ie., not currently bound) and can
> + * be mapped to whole object or a section of the object (partial binding).
> + * Multiple VA mappings can be created to the same section of the object
> + * (aliasing).
> + */
> +struct drm_i915_gem_vm_bind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @handle: Object handle */
> +	__u32 handle;
> +
> +	/** @start: Virtual Address start to bind */
> +	__u64 start;
> +
> +	/** @offset: Offset in object to bind */
> +	__u64 offset;
> +
> +	/** @length: Length of mapping to bind */
> +	__u64 length;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_GEM_VM_BIND_READONLY:
> +	 * Mapping is read-only.
> +	 *
> +	 * I915_GEM_VM_BIND_CAPTURE:
> +	 * Capture this mapping in the dump upon GPU error.
> +	 */
> +	__u64 flags;
> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
> + *
> + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
> + * address (VA) range that should be unbound from the device page table of the
> + * specified address space (VM). The specified VA range must match one of the
> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
> + * completion.
> + */
> +struct drm_i915_gem_vm_unbind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @rsvd: Reserved for future use; must be zero. */
> +	__u32 rsvd;
> +
> +	/** @start: Virtual Address start to unbind */
> +	__u64 start;
> +
> +	/** @length: Length of mapping to unbind */
> +	__u64 length;

This probably isn't needed. We are never going to unbind a subset of a
VMA are we? That being said it can't hurt as a sanity check (e.g.
internal vma->length == user unbind length).

> +
> +	/** @flags: reserved for future usage, currently MBZ */
> +	__u64 flags;
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
> + * or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for input fence to signal
> + * before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the returned output fence
> + * after the completion of binding or unbinding.
> + */
> +struct drm_i915_vm_bind_fence {
> +	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
> +	__u32 handle;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
> + * and vm_unbind.
> + *
> + * This structure describes an array of timeline drm_syncobj and associated
> + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
> + */
> +struct drm_i915_vm_bind_ext_timeline_fences {
> +#define I915_VM_BIND_EXT_timeline_FENCES	0
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
> +	 * arrays.
> +	 */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
> +	 * of length @fence_count.
> +	 */
> +	__u64 handles_ptr;
> +
> +	/**
> +	 * @values_ptr: Pointer to an array of u64 values of length
> +	 * @fence_count.
> +	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
> +	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +	 * binary one.
> +	 */
> +	__u64 values_ptr;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
> + * vm_bind or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
> + * @addr to become equal to @val) before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the output fence after
> + * the completion of binding or unbinding by writing @val to memory location at
> + * @addr
> + */
> +struct drm_i915_vm_bind_user_fence {
> +	/** @addr: User/Memory fence qword aligned process virtual address */
> +	__u64 addr;
> +
> +	/** @val: User/Memory fence value to be written after bind completion */
> +	__u64 val;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
> +	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
> + * and vm_unbind.
> + *
> + * These user fences can be input or output fences
> + * (See struct drm_i915_vm_bind_user_fence).
> + */
> +struct drm_i915_vm_bind_ext_user_fence {
> +#define I915_VM_BIND_EXT_USER_FENCES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @fence_count: Number of elements in the @user_fence_ptr array. */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @user_fence_ptr: Pointer to an array of
> +	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
> +	 */
> +	__u64 user_fence_ptr;
> +};
> +

IMO all of these fence structs should be a generic sync interface shared
between both vm bind and exec3 rather than unique extenisons.

Both vm bind and exec3 should have something like this:

__64 syncs;	/* userptr to an array of generic syncs */
__64 n_syncs;

Having an array of syncs lets the kernel do one user copy for all the
syncs rather than reading them in a a chain.

A generic sync object encapsulates all possible syncs (in / out -
syncobj, syncobj timeline, ufence, future sync concepts).

e.g.

struct {
	__u32 user_ext;
	__u32 flag;	/* in / out, type, whatever else info we need */
	union {
		__u32 handle; 	/* to syncobj */
		__u64 addr; 	/* ufence address */
	};
	__64 seqno;	/* syncobj timeline, ufence write value */
	...reserve enough bits for future...
}

This unifies binds and execs by using the same sync interface
instilling the concept that binds and execs are the same op (queue'd
operation /w in/out fences).

Matt

> +/**
> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
> + * gpu virtual addresses.
> + *
> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
> + * must always be appended in the VM_BIND mode and it will be an error to
> + * append this extension in older non-VM_BIND mode.
> + */
> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @count: Number of addresses in the addr array. */
> +	__u32 count;
> +
> +	/** @addr: An array of batch gpu virtual addresses. */
> +	__u64 addr[0];
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
> + * signaling extension.
> + *
> + * This extension allows user to attach a user fence (@addr, @value pair) to an
> + * execbuf to be signaled by the command streamer after the completion of first
> + * level batch, by writing the @value at specified @addr and triggering an
> + * interrupt.
> + * User can either poll for this user fence to signal or can also wait on it
> + * with i915_gem_wait_user_fence ioctl.
> + * This is very much usefaul for long running contexts where waiting on dma-fence
> + * by user (like i915_gem_wait ioctl) is not supported.
> + */
> +struct drm_i915_gem_execbuffer_ext_user_fence {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @addr: User/Memory fence qword aligned GPU virtual address.
> +	 *
> +	 * Address has to be a valid GPU virtual address at the time of
> +	 * first level batch completion.
> +	 */
> +	__u64 addr;
> +
> +	/**
> +	 * @value: User/Memory fence Value to be written to above address
> +	 * after first level batch completes.
> +	 */
> +	__u64 value;
> +
> +	/** @rsvd: Reserved for future extensions, MBZ */
> +	__u64 rsvd;
> +};
> +
> +/**
> + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
> + * private to the specified VM.
> + *
> + * See struct drm_i915_gem_create_ext.
> + */
> +struct drm_i915_gem_create_ext_vm_private {
> +#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @vm_id: Id of the VM to which the object is private */
> +	__u32 vm_id;
> +};
> +
> +/**
> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
> + *
> + * User/Memory fence can be woken up either by:
> + *
> + * 1. GPU context indicated by @ctx_id, or,
> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
> + *    @ctx_id is ignored when this flag is set.
> + *
> + * Wakeup condition is,
> + * ``((*addr & mask) op (value & mask))``
> + *
> + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
> + */
> +struct drm_i915_gem_wait_user_fence {
> +	/** @extensions: Zero-terminated chain of extensions. */
> +	__u64 extensions;
> +
> +	/** @addr: User/Memory fence address */
> +	__u64 addr;
> +
> +	/** @ctx_id: Id of the Context which will signal the fence. */
> +	__u32 ctx_id;
> +
> +	/** @op: Wakeup condition operator */
> +	__u16 op;
> +#define I915_UFENCE_WAIT_EQ      0
> +#define I915_UFENCE_WAIT_NEQ     1
> +#define I915_UFENCE_WAIT_GT      2
> +#define I915_UFENCE_WAIT_GTE     3
> +#define I915_UFENCE_WAIT_LT      4
> +#define I915_UFENCE_WAIT_LTE     5
> +#define I915_UFENCE_WAIT_BEFORE  6
> +#define I915_UFENCE_WAIT_AFTER   7
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_UFENCE_WAIT_SOFT:
> +	 *
> +	 * To be woken up by i915 driver async worker (not by GPU).
> +	 *
> +	 * I915_UFENCE_WAIT_ABSTIME:
> +	 *
> +	 * Wait timeout specified as absolute time.
> +	 */
> +	__u16 flags;
> +#define I915_UFENCE_WAIT_SOFT    0x1
> +#define I915_UFENCE_WAIT_ABSTIME 0x2
> +
> +	/** @value: Wakeup value */
> +	__u64 value;
> +
> +	/** @mask: Wakeup mask */
> +	__u64 mask;
> +#define I915_UFENCE_WAIT_U8     0xffu
> +#define I915_UFENCE_WAIT_U16    0xffffu
> +#define I915_UFENCE_WAIT_U32    0xfffffffful
> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
> +
> +	/**
> +	 * @timeout: Wait timeout in nanoseconds.
> +	 *
> +	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
> +	 * absolute time in nsec.
> +	 */
> +	__s64 timeout;
> +};
> -- 
> 2.21.0.rc0.32.g243a4c7e27
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-10  8:34     ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-10  8:34 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On Tue, May 17, 2022 at 11:32:12AM -0700, Niranjana Vishwanathapura wrote:
> VM_BIND and related uapi definitions
> 
> v2: Ensure proper kernel-doc formatting with cross references.
>     Also add new uapi and documentation as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>  1 file changed, 399 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..589c0a009107
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,399 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND		57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> + * to pass in the batch buffer addresses.
> + *
> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> + */
> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
> +
> +/**
> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> + *
> + * Flag to declare context as long running.
> + * See struct drm_i915_gem_context_create_ext flags.
> + *
> + * Usage of dma-fence expects that they complete in reasonable amount of time.
> + * Compute on the other hand can be long running. Hence it is not appropriate
> + * for compute contexts to export request completion dma-fence to user.
> + * The dma-fence usage will be limited to in-kernel consumption only.
> + * Compute contexts need to use user/memory fence.
> + *
> + * So, long running contexts do not support output fences. Hence,
> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
> + * to be not used.
> + *
> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
> + * to long running contexts.
> + */
> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
> +
> +/* VM_BIND related ioctls */
> +#define DRM_I915_GEM_VM_BIND		0x3d
> +#define DRM_I915_GEM_VM_UNBIND		0x3e
> +#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
> +
> +#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
> +
> +/**
> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
> + *
> + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
> + * virtual address (VA) range to the section of an object that should be bound
> + * in the device page table of the specified address space (VM).
> + * The VA range specified must be unique (ie., not currently bound) and can
> + * be mapped to whole object or a section of the object (partial binding).
> + * Multiple VA mappings can be created to the same section of the object
> + * (aliasing).
> + */
> +struct drm_i915_gem_vm_bind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @handle: Object handle */
> +	__u32 handle;
> +
> +	/** @start: Virtual Address start to bind */
> +	__u64 start;
> +
> +	/** @offset: Offset in object to bind */
> +	__u64 offset;
> +
> +	/** @length: Length of mapping to bind */
> +	__u64 length;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_GEM_VM_BIND_READONLY:
> +	 * Mapping is read-only.
> +	 *
> +	 * I915_GEM_VM_BIND_CAPTURE:
> +	 * Capture this mapping in the dump upon GPU error.
> +	 */
> +	__u64 flags;
> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
> + *
> + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
> + * address (VA) range that should be unbound from the device page table of the
> + * specified address space (VM). The specified VA range must match one of the
> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
> + * completion.
> + */
> +struct drm_i915_gem_vm_unbind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @rsvd: Reserved for future use; must be zero. */
> +	__u32 rsvd;
> +
> +	/** @start: Virtual Address start to unbind */
> +	__u64 start;
> +
> +	/** @length: Length of mapping to unbind */
> +	__u64 length;

This probably isn't needed. We are never going to unbind a subset of a
VMA are we? That being said it can't hurt as a sanity check (e.g.
internal vma->length == user unbind length).

> +
> +	/** @flags: reserved for future usage, currently MBZ */
> +	__u64 flags;
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
> + * or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for input fence to signal
> + * before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the returned output fence
> + * after the completion of binding or unbinding.
> + */
> +struct drm_i915_vm_bind_fence {
> +	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
> +	__u32 handle;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
> + * and vm_unbind.
> + *
> + * This structure describes an array of timeline drm_syncobj and associated
> + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
> + */
> +struct drm_i915_vm_bind_ext_timeline_fences {
> +#define I915_VM_BIND_EXT_timeline_FENCES	0
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
> +	 * arrays.
> +	 */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
> +	 * of length @fence_count.
> +	 */
> +	__u64 handles_ptr;
> +
> +	/**
> +	 * @values_ptr: Pointer to an array of u64 values of length
> +	 * @fence_count.
> +	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
> +	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +	 * binary one.
> +	 */
> +	__u64 values_ptr;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
> + * vm_bind or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
> + * @addr to become equal to @val) before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the output fence after
> + * the completion of binding or unbinding by writing @val to memory location at
> + * @addr
> + */
> +struct drm_i915_vm_bind_user_fence {
> +	/** @addr: User/Memory fence qword aligned process virtual address */
> +	__u64 addr;
> +
> +	/** @val: User/Memory fence value to be written after bind completion */
> +	__u64 val;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
> +	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
> + * and vm_unbind.
> + *
> + * These user fences can be input or output fences
> + * (See struct drm_i915_vm_bind_user_fence).
> + */
> +struct drm_i915_vm_bind_ext_user_fence {
> +#define I915_VM_BIND_EXT_USER_FENCES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @fence_count: Number of elements in the @user_fence_ptr array. */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @user_fence_ptr: Pointer to an array of
> +	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
> +	 */
> +	__u64 user_fence_ptr;
> +};
> +

IMO all of these fence structs should be a generic sync interface shared
between both vm bind and exec3 rather than unique extenisons.

Both vm bind and exec3 should have something like this:

__64 syncs;	/* userptr to an array of generic syncs */
__64 n_syncs;

Having an array of syncs lets the kernel do one user copy for all the
syncs rather than reading them in a a chain.

A generic sync object encapsulates all possible syncs (in / out -
syncobj, syncobj timeline, ufence, future sync concepts).

e.g.

struct {
	__u32 user_ext;
	__u32 flag;	/* in / out, type, whatever else info we need */
	union {
		__u32 handle; 	/* to syncobj */
		__u64 addr; 	/* ufence address */
	};
	__64 seqno;	/* syncobj timeline, ufence write value */
	...reserve enough bits for future...
}

This unifies binds and execs by using the same sync interface
instilling the concept that binds and execs are the same op (queue'd
operation /w in/out fences).

Matt

> +/**
> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
> + * gpu virtual addresses.
> + *
> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
> + * must always be appended in the VM_BIND mode and it will be an error to
> + * append this extension in older non-VM_BIND mode.
> + */
> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @count: Number of addresses in the addr array. */
> +	__u32 count;
> +
> +	/** @addr: An array of batch gpu virtual addresses. */
> +	__u64 addr[0];
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
> + * signaling extension.
> + *
> + * This extension allows user to attach a user fence (@addr, @value pair) to an
> + * execbuf to be signaled by the command streamer after the completion of first
> + * level batch, by writing the @value at specified @addr and triggering an
> + * interrupt.
> + * User can either poll for this user fence to signal or can also wait on it
> + * with i915_gem_wait_user_fence ioctl.
> + * This is very much usefaul for long running contexts where waiting on dma-fence
> + * by user (like i915_gem_wait ioctl) is not supported.
> + */
> +struct drm_i915_gem_execbuffer_ext_user_fence {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @addr: User/Memory fence qword aligned GPU virtual address.
> +	 *
> +	 * Address has to be a valid GPU virtual address at the time of
> +	 * first level batch completion.
> +	 */
> +	__u64 addr;
> +
> +	/**
> +	 * @value: User/Memory fence Value to be written to above address
> +	 * after first level batch completes.
> +	 */
> +	__u64 value;
> +
> +	/** @rsvd: Reserved for future extensions, MBZ */
> +	__u64 rsvd;
> +};
> +
> +/**
> + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
> + * private to the specified VM.
> + *
> + * See struct drm_i915_gem_create_ext.
> + */
> +struct drm_i915_gem_create_ext_vm_private {
> +#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @vm_id: Id of the VM to which the object is private */
> +	__u32 vm_id;
> +};
> +
> +/**
> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
> + *
> + * User/Memory fence can be woken up either by:
> + *
> + * 1. GPU context indicated by @ctx_id, or,
> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
> + *    @ctx_id is ignored when this flag is set.
> + *
> + * Wakeup condition is,
> + * ``((*addr & mask) op (value & mask))``
> + *
> + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
> + */
> +struct drm_i915_gem_wait_user_fence {
> +	/** @extensions: Zero-terminated chain of extensions. */
> +	__u64 extensions;
> +
> +	/** @addr: User/Memory fence address */
> +	__u64 addr;
> +
> +	/** @ctx_id: Id of the Context which will signal the fence. */
> +	__u32 ctx_id;
> +
> +	/** @op: Wakeup condition operator */
> +	__u16 op;
> +#define I915_UFENCE_WAIT_EQ      0
> +#define I915_UFENCE_WAIT_NEQ     1
> +#define I915_UFENCE_WAIT_GT      2
> +#define I915_UFENCE_WAIT_GTE     3
> +#define I915_UFENCE_WAIT_LT      4
> +#define I915_UFENCE_WAIT_LTE     5
> +#define I915_UFENCE_WAIT_BEFORE  6
> +#define I915_UFENCE_WAIT_AFTER   7
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_UFENCE_WAIT_SOFT:
> +	 *
> +	 * To be woken up by i915 driver async worker (not by GPU).
> +	 *
> +	 * I915_UFENCE_WAIT_ABSTIME:
> +	 *
> +	 * Wait timeout specified as absolute time.
> +	 */
> +	__u16 flags;
> +#define I915_UFENCE_WAIT_SOFT    0x1
> +#define I915_UFENCE_WAIT_ABSTIME 0x2
> +
> +	/** @value: Wakeup value */
> +	__u64 value;
> +
> +	/** @mask: Wakeup mask */
> +	__u64 mask;
> +#define I915_UFENCE_WAIT_U8     0xffu
> +#define I915_UFENCE_WAIT_U16    0xffffu
> +#define I915_UFENCE_WAIT_U32    0xfffffffful
> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
> +
> +	/**
> +	 * @timeout: Wait timeout in nanoseconds.
> +	 *
> +	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
> +	 * absolute time in nsec.
> +	 */
> +	__s64 timeout;
> +};
> -- 
> 2.21.0.rc0.32.g243a4c7e27
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-09 18:53                 ` Niranjana Vishwanathapura
  (?)
@ 2022-06-10 10:16                 ` Tvrtko Ursulin
  2022-06-10 10:32                   ` Matthew Auld
  -1 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-10 10:16 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Matthew Auld
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig


On 09/06/2022 19:53, Niranjana Vishwanathapura wrote:
> On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
>> On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>> On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>>>> On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>>>> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>
>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>     from Daniel.
>>>>>>>>
>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>> ---
>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>
>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> new file mode 100644
>>>>>>>> index 000000000000..589c0a009107
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>> +/*
>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>> + *
>>>>>>>> + * VM_BIND feature availability.
>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>> + */
>>>>>>>> +#define I915_PARAM_HAS_VM_BIND        57
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>> + *
>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>> + *
>>>>>>>> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>>>> mode of binding.
>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>>>>>>> (ie., the
>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>>>> be provided
>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>> + *
>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>>>> must be 0
>>>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>>>> must always be
>>>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>>> batch_len fields
>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>>>> must be 0.
>>>>>>>> + */
>>>>>>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>>>> + *
>>>>>>>> + * Flag to declare context as long running.
>>>>>>>> + * See struct drm_i915_gem_context_create_ext flags.
>>>>>>>> + *
>>>>>>>> + * Usage of dma-fence expects that they complete in reasonable 
>>>>>>>> amount of time.
>>>>>>>> + * Compute on the other hand can be long running. Hence it is 
>>>>>>>> not appropriate
>>>>>>>> + * for compute contexts to export request completion dma-fence 
>>>>>>>> to user.
>>>>>>>> + * The dma-fence usage will be limited to in-kernel consumption 
>>>>>>>> only.
>>>>>>>> + * Compute contexts need to use user/memory fence.
>>>>>>>> + *
>>>>>>>> + * So, long running contexts do not support output fences. Hence,
>>>>>>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) 
>>>>>>>> are expected
>>>>>>>> + * to be not used.
>>>>>>>> + *
>>>>>>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for 
>>>>>>>> objects mapped
>>>>>>>> + * to long running contexts.
>>>>>>>> + */
>>>>>>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>>>> +
>>>>>>>> +/* VM_BIND related ioctls */
>>>>>>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>>>> +
>>>>>>>> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>>>>>> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>>>>>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE 
>>>>>>>> + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>>>>>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct 
>>>>>>>> drm_i915_gem_wait_user_fence)
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>>>> + *
>>>>>>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>>>>>>> mapping of GPU
>>>>>>>> + * virtual address (VA) range to the section of an object that 
>>>>>>>> should be bound
>>>>>>>> + * in the device page table of the specified address space (VM).
>>>>>>>> + * The VA range specified must be unique (ie., not currently 
>>>>>>>> bound) and can
>>>>>>>> + * be mapped to whole object or a section of the object 
>>>>>>>> (partial binding).
>>>>>>>> + * Multiple VA mappings can be created to the same section of 
>>>>>>>> the object
>>>>>>>> + * (aliasing).
>>>>>>>> + */
>>>>>>>> +struct drm_i915_gem_vm_bind {
>>>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>>>> +    __u32 vm_id;
>>>>>>>> +
>>>>>>>> +    /** @handle: Object handle */
>>>>>>>> +    __u32 handle;
>>>>>>>> +
>>>>>>>> +    /** @start: Virtual Address start to bind */
>>>>>>>> +    __u64 start;
>>>>>>>> +
>>>>>>>> +    /** @offset: Offset in object to bind */
>>>>>>>> +    __u64 offset;
>>>>>>>> +
>>>>>>>> +    /** @length: Length of mapping to bind */
>>>>>>>> +    __u64 length;
>>>>>>>
>>>>>>> Does it support, or should it, equivalent of 
>>>>>>> EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map 
>>>>>>> the remainder of the space to a dummy object? In which case would 
>>>>>>> there be any alignment/padding issues preventing the two bind to 
>>>>>>> be placed next to each other?
>>>>>>>
>>>>>>> I ask because someone from the compute side asked me about a 
>>>>>>> problem with their strategy of dealing with overfetch and I 
>>>>>>> suggested pad to size.
>>>>>>>
>>>>>>
>>>>>> Thanks Tvrtko,
>>>>>> I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>>>> is completely pushed to userspace, no padding should be necessary
>>>>>> once the 'start' and 'size' alignment conditions are met.
>>>>>>
>>>>>> I will add some documentation on alignment requirement here.
>>>>>> Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>>>> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>>>> be 64K aligned.
>>>>>
>>>>> + Matt
>>>>>
>>>>> Align to 64k is enough for all overfetch issues?
>>>>>
>>>>> Apparently compute has a situation where a buffer is received by 
>>>>> one component and another has to apply more alignment to it, to 
>>>>> deal with overfetch. Since they cannot grow the actual BO if they 
>>>>> wanted to VM_BIND a scratch area on top? Or perhaps none of this is 
>>>>> a problem on discrete and original BO should be correctly allocated 
>>>>> to start with.
>>>>>
>>>>> Side question - what about the align to 2MiB mentioned in 
>>>>> i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not 
>>>>> apply to discrete?
>>>>
>>>> Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>>>> require a minimum of 64K pages underneath for local memory, and the 
>>>> BO size will also be rounded up accordingly. And yeah the 
>>>> complication arises due to not being able to mix 4K + 64K GTT pages 
>>>> within the same page-table (existed since even gen8). Note that 4K 
>>>> here is what we typically get for system memory.
>>>>
>>>> Originally we had a memory coloring scheme to track the "color" of 
>>>> each page-table, which basically ensures that userspace can't do 
>>>> something nasty like mixing page sizes. The advantage of that scheme 
>>>> is that we would only require 64K GTT alignment and no extra 
>>>> padding, but is perhaps a little complex.
>>>>
>>>> The merged solution is just to align and pad (i.e vma->node.size and 
>>>> not vma->size) out of the vma to 2M, which is dead simple 
>>>> implementation wise, but does potentially waste some GTT space and 
>>>> some of the local memory used for the actual page-table. For the 
>>>> alignment the kernel just validates that the GTT address is aligned 
>>>> to 2M in vma_insert(), and then for the padding it just inflates it 
>>>> to 2M, if userspace hasn't already.
>>>>
>>>> See the kernel-doc for @size: 
>>>> https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext 
>>>>
>>>>
>>>>
>>>
>>> Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
>>> This is unfortunate, but it is not something new enforced by VM_BIND.
>>> Other option is to go with 64K alignment and in VM_BIND case, user
>>> must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
>>> mappings in the same 2M range. But this is not VM_BIND specific
>>> (will apply to soft-pinning in execbuf2 also).
>>>
>>> I don't think we need any VA padding here as with VM_BIND VA is
>>> managed fully by the user. If we enforce VA to be 2M aligned, it
>>> will leave holes (if BOs are smaller then 2M), but nobody is going
>>> to allocate anything form there.
>>
>> Note that we only apply the 2M alignment + padding for local memory 
>> pages, for system memory we don't have/need such restrictions. The VA 
>> padding then importantly prevents userspace from incorrectly (or 
>> maliciously) inserting 4K system memory object in some page-table 
>> operating in 64K GTT mode.
>>
> 
> Thanks Matt.
> I also, syned offline with Matt a bit on this.
> We don't need explicit 'pad_to_size' size. i915 driver is implicitly
> padding the size to 2M boundary for LMEM BOs which will apply for
> VM_BIND also.
> The remaining question is whether we enforce 2M VA alignment for
> lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with
> just 64K alignment but ensure there is no mixing of 4K and 64K

"Driver is implicitly padding the size to 2MB boundary" - this is the 
backing store?

> mappings in same 2M range. I think we can go with 2M alignment
> requirement for VM_BIND also. So, no new requirements here for VM_BIND.

Are there any considerations here of letting the userspace know? 
Presumably userspace allocator has to know or it would try to ask for 
impossible addresses.

Regards,

Tvrtko

> 
> I will update the documentation.
> 
> Niranjana
> 
>>>
>>> Niranjana
>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>>
>>>>>> Niranjana
>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tvrtko
>>>>>>>
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>> +     *
>>>>>>>> +     * I915_GEM_VM_BIND_READONLY:
>>>>>>>> +     * Mapping is read-only.
>>>>>>>> +     *
>>>>>>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>>>>>>> +     * Capture this mapping in the dump upon GPU error.
>>>>>>>> +     */
>>>>>>>> +    __u64 flags;
>>>>>>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>>>> +
>>>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>>>> mapping. */
>>>>>>>> +    __u64 extensions;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>>>>>>> + *
>>>>>>>> + * This structure is passed to VM_UNBIND ioctl and specifies 
>>>>>>>> the GPU virtual
>>>>>>>> + * address (VA) range that should be unbound from the device 
>>>>>>>> page table of the
>>>>>>>> + * specified address space (VM). The specified VA range must 
>>>>>>>> match one of the
>>>>>>>> + * mappings created with the VM_BIND ioctl. TLB is flushed upon 
>>>>>>>> unbind
>>>>>>>> + * completion.
>>>>>>>> + */
>>>>>>>> +struct drm_i915_gem_vm_unbind {
>>>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>>>> +    __u32 vm_id;
>>>>>>>> +
>>>>>>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>>>>>>> +    __u32 rsvd;
>>>>>>>> +
>>>>>>>> +    /** @start: Virtual Address start to unbind */
>>>>>>>> +    __u64 start;
>>>>>>>> +
>>>>>>>> +    /** @length: Length of mapping to unbind */
>>>>>>>> +    __u64 length;
>>>>>>>> +
>>>>>>>> +    /** @flags: reserved for future usage, currently MBZ */
>>>>>>>> +    __u64 flags;
>>>>>>>> +
>>>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>>>> mapping. */
>>>>>>>> +    __u64 extensions;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_vm_bind_fence - An input or output fence for 
>>>>>>>> the vm_bind
>>>>>>>> + * or the vm_unbind work.
>>>>>>>> + *
>>>>>>>> + * The vm_bind or vm_unbind aync worker will wait for input 
>>>>>>>> fence to signal
>>>>>>>> + * before starting the binding or unbinding.
>>>>>>>> + *
>>>>>>>> + * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>> returned output fence
>>>>>>>> + * after the completion of binding or unbinding.
>>>>>>>> + */
>>>>>>>> +struct drm_i915_vm_bind_fence {
>>>>>>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>>>>>>> signal. */
>>>>>>>> +    __u32 handle;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>> +     *
>>>>>>>> +     * I915_VM_BIND_FENCE_WAIT:
>>>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>>>> +     *
>>>>>>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>>>> +     * Return bind/unbind completion fence as output
>>>>>>>> +     */
>>>>>>>> +    __u32 flags;
>>>>>>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline 
>>>>>>>> fences for vm_bind
>>>>>>>> + * and vm_unbind.
>>>>>>>> + *
>>>>>>>> + * This structure describes an array of timeline drm_syncobj 
>>>>>>>> and associated
>>>>>>>> + * points for timeline variants of drm_syncobj. These timeline 
>>>>>>>> 'drm_syncobj's
>>>>>>>> + * can be input or output fences (See struct 
>>>>>>>> drm_i915_vm_bind_fence).
>>>>>>>> + */
>>>>>>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>> +    struct i915_user_extension base;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>>>>>>> @value_ptr
>>>>>>>> +     * arrays.
>>>>>>>> +     */
>>>>>>>> +    __u64 fence_count;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @handles_ptr: Pointer to an array of struct 
>>>>>>>> drm_i915_vm_bind_fence
>>>>>>>> +     * of length @fence_count.
>>>>>>>> +     */
>>>>>>>> +    __u64 handles_ptr;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>>>>>>> +     * @fence_count.
>>>>>>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 
>>>>>>>> for a
>>>>>>>> +     * timeline drm_syncobj is invalid as it turns a 
>>>>>>>> drm_syncobj into a
>>>>>>>> +     * binary one.
>>>>>>>> +     */
>>>>>>>> +    __u64 values_ptr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_vm_bind_user_fence - An input or output user 
>>>>>>>> fence for the
>>>>>>>> + * vm_bind or the vm_unbind work.
>>>>>>>> + *
>>>>>>>> + * The vm_bind or vm_unbind aync worker will wait for the input 
>>>>>>>> fence (value at
>>>>>>>> + * @addr to become equal to @val) before starting the binding 
>>>>>>>> or unbinding.
>>>>>>>> + *
>>>>>>>> + * The vm_bind or vm_unbind async worker will signal the output 
>>>>>>>> fence after
>>>>>>>> + * the completion of binding or unbinding by writing @val to 
>>>>>>>> memory location at
>>>>>>>> + * @addr
>>>>>>>> + */
>>>>>>>> +struct drm_i915_vm_bind_user_fence {
>>>>>>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>>>>>>> address */
>>>>>>>> +    __u64 addr;
>>>>>>>> +
>>>>>>>> +    /** @val: User/Memory fence value to be written after bind 
>>>>>>>> completion */
>>>>>>>> +    __u64 val;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>> +     *
>>>>>>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>>>> +     *
>>>>>>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>>>> +     * Return bind/unbind completion fence as output
>>>>>>>> +     */
>>>>>>>> +    __u32 flags;
>>>>>>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences 
>>>>>>>> for vm_bind
>>>>>>>> + * and vm_unbind.
>>>>>>>> + *
>>>>>>>> + * These user fences can be input or output fences
>>>>>>>> + * (See struct drm_i915_vm_bind_user_fence).
>>>>>>>> + */
>>>>>>>> +struct drm_i915_vm_bind_ext_user_fence {
>>>>>>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>> +    struct i915_user_extension base;
>>>>>>>> +
>>>>>>>> +    /** @fence_count: Number of elements in the @user_fence_ptr 
>>>>>>>> array. */
>>>>>>>> +    __u64 fence_count;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @user_fence_ptr: Pointer to an array of
>>>>>>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>>>> +     */
>>>>>>>> +    __u64 user_fence_ptr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array 
>>>>>>>> of batch buffer
>>>>>>>> + * gpu virtual addresses.
>>>>>>>> + *
>>>>>>>> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), 
>>>>>>>> this extension
>>>>>>>> + * must always be appended in the VM_BIND mode and it will be 
>>>>>>>> an error to
>>>>>>>> + * append this extension in older non-VM_BIND mode.
>>>>>>>> + */
>>>>>>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>> +    struct i915_user_extension base;
>>>>>>>> +
>>>>>>>> +    /** @count: Number of addresses in the addr array. */
>>>>>>>> +    __u32 count;
>>>>>>>> +
>>>>>>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>>>>>>> +    __u64 addr[0];
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level 
>>>>>>>> batch completion
>>>>>>>> + * signaling extension.
>>>>>>>> + *
>>>>>>>> + * This extension allows user to attach a user fence (@addr, 
>>>>>>>> @value pair) to an
>>>>>>>> + * execbuf to be signaled by the command streamer after the 
>>>>>>>> completion of first
>>>>>>>> + * level batch, by writing the @value at specified @addr and 
>>>>>>>> triggering an
>>>>>>>> + * interrupt.
>>>>>>>> + * User can either poll for this user fence to signal or can 
>>>>>>>> also wait on it
>>>>>>>> + * with i915_gem_wait_user_fence ioctl.
>>>>>>>> + * This is very much usefaul for long running contexts where 
>>>>>>>> waiting on dma-fence
>>>>>>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>>>>>>> + */
>>>>>>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>> +    struct i915_user_extension base;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @addr: User/Memory fence qword aligned GPU virtual address.
>>>>>>>> +     *
>>>>>>>> +     * Address has to be a valid GPU virtual address at the 
>>>>>>>> time of
>>>>>>>> +     * first level batch completion.
>>>>>>>> +     */
>>>>>>>> +    __u64 addr;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @value: User/Memory fence Value to be written to above 
>>>>>>>> address
>>>>>>>> +     * after first level batch completes.
>>>>>>>> +     */
>>>>>>>> +    __u64 value;
>>>>>>>> +
>>>>>>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>>>> +    __u64 rsvd;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to 
>>>>>>>> make the object
>>>>>>>> + * private to the specified VM.
>>>>>>>> + *
>>>>>>>> + * See struct drm_i915_gem_create_ext.
>>>>>>>> + */
>>>>>>>> +struct drm_i915_gem_create_ext_vm_private {
>>>>>>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>> +    struct i915_user_extension base;
>>>>>>>> +
>>>>>>>> +    /** @vm_id: Id of the VM to which the object is private */
>>>>>>>> +    __u32 vm_id;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory 
>>>>>>>> fence.
>>>>>>>> + *
>>>>>>>> + * User/Memory fence can be woken up either by:
>>>>>>>> + *
>>>>>>>> + * 1. GPU context indicated by @ctx_id, or,
>>>>>>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>>>> + *    @ctx_id is ignored when this flag is set.
>>>>>>>> + *
>>>>>>>> + * Wakeup condition is,
>>>>>>>> + * ``((*addr & mask) op (value & mask))``
>>>>>>>> + *
>>>>>>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>>>>> <indefinite_dma_fences>`
>>>>>>>> + */
>>>>>>>> +struct drm_i915_gem_wait_user_fence {
>>>>>>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>>>>>>> +    __u64 extensions;
>>>>>>>> +
>>>>>>>> +    /** @addr: User/Memory fence address */
>>>>>>>> +    __u64 addr;
>>>>>>>> +
>>>>>>>> +    /** @ctx_id: Id of the Context which will signal the fence. */
>>>>>>>> +    __u32 ctx_id;
>>>>>>>> +
>>>>>>>> +    /** @op: Wakeup condition operator */
>>>>>>>> +    __u16 op;
>>>>>>>> +#define I915_UFENCE_WAIT_EQ      0
>>>>>>>> +#define I915_UFENCE_WAIT_NEQ     1
>>>>>>>> +#define I915_UFENCE_WAIT_GT      2
>>>>>>>> +#define I915_UFENCE_WAIT_GTE     3
>>>>>>>> +#define I915_UFENCE_WAIT_LT      4
>>>>>>>> +#define I915_UFENCE_WAIT_LTE     5
>>>>>>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>>>>>>> +#define I915_UFENCE_WAIT_AFTER   7
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>> +     *
>>>>>>>> +     * I915_UFENCE_WAIT_SOFT:
>>>>>>>> +     *
>>>>>>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>>>>>>> +     *
>>>>>>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>>>>>>> +     *
>>>>>>>> +     * Wait timeout specified as absolute time.
>>>>>>>> +     */
>>>>>>>> +    __u16 flags;
>>>>>>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>>>> +
>>>>>>>> +    /** @value: Wakeup value */
>>>>>>>> +    __u64 value;
>>>>>>>> +
>>>>>>>> +    /** @mask: Wakeup mask */
>>>>>>>> +    __u64 mask;
>>>>>>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>>>>>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @timeout: Wait timeout in nanoseconds.
>>>>>>>> +     *
>>>>>>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time 
>>>>>>>> timeout is the
>>>>>>>> +     * absolute time in nsec.
>>>>>>>> +     */
>>>>>>>> +    __s64 timeout;
>>>>>>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-10 10:16                 ` Tvrtko Ursulin
@ 2022-06-10 10:32                   ` Matthew Auld
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Auld @ 2022-06-10 10:32 UTC (permalink / raw)
  To: Tvrtko Ursulin, Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On 10/06/2022 11:16, Tvrtko Ursulin wrote:
> 
> On 09/06/2022 19:53, Niranjana Vishwanathapura wrote:
>> On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
>>> On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>> On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
>>>>> On 08/06/2022 08:17, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
>>>>>>> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>
>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>>     from Daniel.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>>> ---
>>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>
>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>> new file mode 100644
>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>> +/*
>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>> + *
>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>> + */
>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND        57
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>> + *
>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>> + *
>>>>>>>>> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>>>>> mode of binding.
>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>>>> execlist (ie., the
>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>>>>> be provided
>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>> + *
>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>>>>> must be 0
>>>>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>>>>> must always be
>>>>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>>>> batch_len fields
>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>>>>> must be 0.
>>>>>>>>> + */
>>>>>>>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>>>>> + *
>>>>>>>>> + * Flag to declare context as long running.
>>>>>>>>> + * See struct drm_i915_gem_context_create_ext flags.
>>>>>>>>> + *
>>>>>>>>> + * Usage of dma-fence expects that they complete in reasonable 
>>>>>>>>> amount of time.
>>>>>>>>> + * Compute on the other hand can be long running. Hence it is 
>>>>>>>>> not appropriate
>>>>>>>>> + * for compute contexts to export request completion dma-fence 
>>>>>>>>> to user.
>>>>>>>>> + * The dma-fence usage will be limited to in-kernel 
>>>>>>>>> consumption only.
>>>>>>>>> + * Compute contexts need to use user/memory fence.
>>>>>>>>> + *
>>>>>>>>> + * So, long running contexts do not support output fences. Hence,
>>>>>>>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>>>>>>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) 
>>>>>>>>> are expected
>>>>>>>>> + * to be not used.
>>>>>>>>> + *
>>>>>>>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for 
>>>>>>>>> objects mapped
>>>>>>>>> + * to long running contexts.
>>>>>>>>> + */
>>>>>>>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>>>>> +
>>>>>>>>> +/* VM_BIND related ioctls */
>>>>>>>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>>>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>>>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>>>>>>>> +
>>>>>>>>> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
>>>>>>>>> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>>>>>>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE 
>>>>>>>>> + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>>>>>>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE 
>>>>>>>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, 
>>>>>>>>> struct drm_i915_gem_wait_user_fence)
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>>>>>>>> + *
>>>>>>>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>>>>>>>> mapping of GPU
>>>>>>>>> + * virtual address (VA) range to the section of an object that 
>>>>>>>>> should be bound
>>>>>>>>> + * in the device page table of the specified address space (VM).
>>>>>>>>> + * The VA range specified must be unique (ie., not currently 
>>>>>>>>> bound) and can
>>>>>>>>> + * be mapped to whole object or a section of the object 
>>>>>>>>> (partial binding).
>>>>>>>>> + * Multiple VA mappings can be created to the same section of 
>>>>>>>>> the object
>>>>>>>>> + * (aliasing).
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_gem_vm_bind {
>>>>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>>>>> +    __u32 vm_id;
>>>>>>>>> +
>>>>>>>>> +    /** @handle: Object handle */
>>>>>>>>> +    __u32 handle;
>>>>>>>>> +
>>>>>>>>> +    /** @start: Virtual Address start to bind */
>>>>>>>>> +    __u64 start;
>>>>>>>>> +
>>>>>>>>> +    /** @offset: Offset in object to bind */
>>>>>>>>> +    __u64 offset;
>>>>>>>>> +
>>>>>>>>> +    /** @length: Length of mapping to bind */
>>>>>>>>> +    __u64 length;
>>>>>>>>
>>>>>>>> Does it support, or should it, equivalent of 
>>>>>>>> EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map 
>>>>>>>> the remainder of the space to a dummy object? In which case 
>>>>>>>> would there be any alignment/padding issues preventing the two 
>>>>>>>> bind to be placed next to each other?
>>>>>>>>
>>>>>>>> I ask because someone from the compute side asked me about a 
>>>>>>>> problem with their strategy of dealing with overfetch and I 
>>>>>>>> suggested pad to size.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks Tvrtko,
>>>>>>> I think we shouldn't be needing it. As with VM_BIND VA assignment
>>>>>>> is completely pushed to userspace, no padding should be necessary
>>>>>>> once the 'start' and 'size' alignment conditions are met.
>>>>>>>
>>>>>>> I will add some documentation on alignment requirement here.
>>>>>>> Generally, 'start' and 'size' should be 4K aligned. But, I think
>>>>>>> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
>>>>>>> be 64K aligned.
>>>>>>
>>>>>> + Matt
>>>>>>
>>>>>> Align to 64k is enough for all overfetch issues?
>>>>>>
>>>>>> Apparently compute has a situation where a buffer is received by 
>>>>>> one component and another has to apply more alignment to it, to 
>>>>>> deal with overfetch. Since they cannot grow the actual BO if they 
>>>>>> wanted to VM_BIND a scratch area on top? Or perhaps none of this 
>>>>>> is a problem on discrete and original BO should be correctly 
>>>>>> allocated to start with.
>>>>>>
>>>>>> Side question - what about the align to 2MiB mentioned in 
>>>>>> i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not 
>>>>>> apply to discrete?
>>>>>
>>>>> Not sure about the overfetch thing, but yeah dg2 & xehpsdv both 
>>>>> require a minimum of 64K pages underneath for local memory, and the 
>>>>> BO size will also be rounded up accordingly. And yeah the 
>>>>> complication arises due to not being able to mix 4K + 64K GTT pages 
>>>>> within the same page-table (existed since even gen8). Note that 4K 
>>>>> here is what we typically get for system memory.
>>>>>
>>>>> Originally we had a memory coloring scheme to track the "color" of 
>>>>> each page-table, which basically ensures that userspace can't do 
>>>>> something nasty like mixing page sizes. The advantage of that 
>>>>> scheme is that we would only require 64K GTT alignment and no extra 
>>>>> padding, but is perhaps a little complex.
>>>>>
>>>>> The merged solution is just to align and pad (i.e vma->node.size 
>>>>> and not vma->size) out of the vma to 2M, which is dead simple 
>>>>> implementation wise, but does potentially waste some GTT space and 
>>>>> some of the local memory used for the actual page-table. For the 
>>>>> alignment the kernel just validates that the GTT address is aligned 
>>>>> to 2M in vma_insert(), and then for the padding it just inflates it 
>>>>> to 2M, if userspace hasn't already.
>>>>>
>>>>> See the kernel-doc for @size: 
>>>>> https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_create_ext 
>>>>>
>>>>>
>>>>>
>>>>
>>>> Ok, those requirements (2M VA alignment) will apply to VM_BIND also.
>>>> This is unfortunate, but it is not something new enforced by VM_BIND.
>>>> Other option is to go with 64K alignment and in VM_BIND case, user
>>>> must ensure there is no mix-matching of 64K (lmem) and 4k (smem)
>>>> mappings in the same 2M range. But this is not VM_BIND specific
>>>> (will apply to soft-pinning in execbuf2 also).
>>>>
>>>> I don't think we need any VA padding here as with VM_BIND VA is
>>>> managed fully by the user. If we enforce VA to be 2M aligned, it
>>>> will leave holes (if BOs are smaller then 2M), but nobody is going
>>>> to allocate anything form there.
>>>
>>> Note that we only apply the 2M alignment + padding for local memory 
>>> pages, for system memory we don't have/need such restrictions. The VA 
>>> padding then importantly prevents userspace from incorrectly (or 
>>> maliciously) inserting 4K system memory object in some page-table 
>>> operating in 64K GTT mode.
>>>
>>
>> Thanks Matt.
>> I also, syned offline with Matt a bit on this.
>> We don't need explicit 'pad_to_size' size. i915 driver is implicitly
>> padding the size to 2M boundary for LMEM BOs which will apply for
>> VM_BIND also.
>> The remaining question is whether we enforce 2M VA alignment for
>> lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with
>> just 64K alignment but ensure there is no mixing of 4K and 64K
> 
> "Driver is implicitly padding the size to 2MB boundary" - this is the 
> backing store?

Just the GTT space, i.e vma->node.size. Backing store just needs to use 
64K pages.

> 
>> mappings in same 2M range. I think we can go with 2M alignment
>> requirement for VM_BIND also. So, no new requirements here for VM_BIND.
> 
> Are there any considerations here of letting the userspace know? 
> Presumably userspace allocator has to know or it would try to ask for 
> impossible addresses.

It's the existing behaviour with execbuf, so I assume userspace must 
already get this right, on platforms like dg2.

> 
> Regards,
> 
> Tvrtko
> 
>>
>> I will update the documentation.
>>
>> Niranjana
>>
>>>>
>>>> Niranjana
>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tvrtko
>>>>>>
>>>>>>>
>>>>>>> Niranjana
>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Tvrtko
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>>> +     *
>>>>>>>>> +     * I915_GEM_VM_BIND_READONLY:
>>>>>>>>> +     * Mapping is read-only.
>>>>>>>>> +     *
>>>>>>>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>>>>>>>> +     * Capture this mapping in the dump upon GPU error.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 flags;
>>>>>>>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>>>>>>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>>>>>>>> +
>>>>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>>>>> mapping. */
>>>>>>>>> +    __u64 extensions;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to 
>>>>>>>>> unbind.
>>>>>>>>> + *
>>>>>>>>> + * This structure is passed to VM_UNBIND ioctl and specifies 
>>>>>>>>> the GPU virtual
>>>>>>>>> + * address (VA) range that should be unbound from the device 
>>>>>>>>> page table of the
>>>>>>>>> + * specified address space (VM). The specified VA range must 
>>>>>>>>> match one of the
>>>>>>>>> + * mappings created with the VM_BIND ioctl. TLB is flushed 
>>>>>>>>> upon unbind
>>>>>>>>> + * completion.
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_gem_vm_unbind {
>>>>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>>>>> +    __u32 vm_id;
>>>>>>>>> +
>>>>>>>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>>>>>>>> +    __u32 rsvd;
>>>>>>>>> +
>>>>>>>>> +    /** @start: Virtual Address start to unbind */
>>>>>>>>> +    __u64 start;
>>>>>>>>> +
>>>>>>>>> +    /** @length: Length of mapping to unbind */
>>>>>>>>> +    __u64 length;
>>>>>>>>> +
>>>>>>>>> +    /** @flags: reserved for future usage, currently MBZ */
>>>>>>>>> +    __u64 flags;
>>>>>>>>> +
>>>>>>>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>>>>>>>> mapping. */
>>>>>>>>> +    __u64 extensions;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_vm_bind_fence - An input or output fence 
>>>>>>>>> for the vm_bind
>>>>>>>>> + * or the vm_unbind work.
>>>>>>>>> + *
>>>>>>>>> + * The vm_bind or vm_unbind aync worker will wait for input 
>>>>>>>>> fence to signal
>>>>>>>>> + * before starting the binding or unbinding.
>>>>>>>>> + *
>>>>>>>>> + * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>>> returned output fence
>>>>>>>>> + * after the completion of binding or unbinding.
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_vm_bind_fence {
>>>>>>>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>>>>>>>> signal. */
>>>>>>>>> +    __u32 handle;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>>> +     *
>>>>>>>>> +     * I915_VM_BIND_FENCE_WAIT:
>>>>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>>>>> +     *
>>>>>>>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>>>>>>>> +     * Return bind/unbind completion fence as output
>>>>>>>>> +     */
>>>>>>>>> +    __u32 flags;
>>>>>>>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>>>>>>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>>>>>>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>>>>>>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline 
>>>>>>>>> fences for vm_bind
>>>>>>>>> + * and vm_unbind.
>>>>>>>>> + *
>>>>>>>>> + * This structure describes an array of timeline drm_syncobj 
>>>>>>>>> and associated
>>>>>>>>> + * points for timeline variants of drm_syncobj. These timeline 
>>>>>>>>> 'drm_syncobj's
>>>>>>>>> + * can be input or output fences (See struct 
>>>>>>>>> drm_i915_vm_bind_fence).
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>>>>>>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>>> +    struct i915_user_extension base;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>>>>>>>> @value_ptr
>>>>>>>>> +     * arrays.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 fence_count;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @handles_ptr: Pointer to an array of struct 
>>>>>>>>> drm_i915_vm_bind_fence
>>>>>>>>> +     * of length @fence_count.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 handles_ptr;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>>>>>>>> +     * @fence_count.
>>>>>>>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 
>>>>>>>>> for a
>>>>>>>>> +     * timeline drm_syncobj is invalid as it turns a 
>>>>>>>>> drm_syncobj into a
>>>>>>>>> +     * binary one.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 values_ptr;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_vm_bind_user_fence - An input or output 
>>>>>>>>> user fence for the
>>>>>>>>> + * vm_bind or the vm_unbind work.
>>>>>>>>> + *
>>>>>>>>> + * The vm_bind or vm_unbind aync worker will wait for the 
>>>>>>>>> input fence (value at
>>>>>>>>> + * @addr to become equal to @val) before starting the binding 
>>>>>>>>> or unbinding.
>>>>>>>>> + *
>>>>>>>>> + * The vm_bind or vm_unbind async worker will signal the 
>>>>>>>>> output fence after
>>>>>>>>> + * the completion of binding or unbinding by writing @val to 
>>>>>>>>> memory location at
>>>>>>>>> + * @addr
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_vm_bind_user_fence {
>>>>>>>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>>>>>>>> address */
>>>>>>>>> +    __u64 addr;
>>>>>>>>> +
>>>>>>>>> +    /** @val: User/Memory fence value to be written after bind 
>>>>>>>>> completion */
>>>>>>>>> +    __u64 val;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>>> +     *
>>>>>>>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>>>>>>>> +     * Wait for the input fence before binding/unbinding
>>>>>>>>> +     *
>>>>>>>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>>>>>>>> +     * Return bind/unbind completion fence as output
>>>>>>>>> +     */
>>>>>>>>> +    __u32 flags;
>>>>>>>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>>>>>>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>>>>>>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>>>>>>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences 
>>>>>>>>> for vm_bind
>>>>>>>>> + * and vm_unbind.
>>>>>>>>> + *
>>>>>>>>> + * These user fences can be input or output fences
>>>>>>>>> + * (See struct drm_i915_vm_bind_user_fence).
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_vm_bind_ext_user_fence {
>>>>>>>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>>> +    struct i915_user_extension base;
>>>>>>>>> +
>>>>>>>>> +    /** @fence_count: Number of elements in the 
>>>>>>>>> @user_fence_ptr array. */
>>>>>>>>> +    __u64 fence_count;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @user_fence_ptr: Pointer to an array of
>>>>>>>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 user_fence_ptr;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array 
>>>>>>>>> of batch buffer
>>>>>>>>> + * gpu virtual addresses.
>>>>>>>>> + *
>>>>>>>>> + * In the execbuff ioctl (See struct 
>>>>>>>>> drm_i915_gem_execbuffer2), this extension
>>>>>>>>> + * must always be appended in the VM_BIND mode and it will be 
>>>>>>>>> an error to
>>>>>>>>> + * append this extension in older non-VM_BIND mode.
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>>>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>>> +    struct i915_user_extension base;
>>>>>>>>> +
>>>>>>>>> +    /** @count: Number of addresses in the addr array. */
>>>>>>>>> +    __u32 count;
>>>>>>>>> +
>>>>>>>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>>>>>>>> +    __u64 addr[0];
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level 
>>>>>>>>> batch completion
>>>>>>>>> + * signaling extension.
>>>>>>>>> + *
>>>>>>>>> + * This extension allows user to attach a user fence (@addr, 
>>>>>>>>> @value pair) to an
>>>>>>>>> + * execbuf to be signaled by the command streamer after the 
>>>>>>>>> completion of first
>>>>>>>>> + * level batch, by writing the @value at specified @addr and 
>>>>>>>>> triggering an
>>>>>>>>> + * interrupt.
>>>>>>>>> + * User can either poll for this user fence to signal or can 
>>>>>>>>> also wait on it
>>>>>>>>> + * with i915_gem_wait_user_fence ioctl.
>>>>>>>>> + * This is very much usefaul for long running contexts where 
>>>>>>>>> waiting on dma-fence
>>>>>>>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>>>>>>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>>> +    struct i915_user_extension base;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @addr: User/Memory fence qword aligned GPU virtual 
>>>>>>>>> address.
>>>>>>>>> +     *
>>>>>>>>> +     * Address has to be a valid GPU virtual address at the 
>>>>>>>>> time of
>>>>>>>>> +     * first level batch completion.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 addr;
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @value: User/Memory fence Value to be written to above 
>>>>>>>>> address
>>>>>>>>> +     * after first level batch completes.
>>>>>>>>> +     */
>>>>>>>>> +    __u64 value;
>>>>>>>>> +
>>>>>>>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>>>>>>>> +    __u64 rsvd;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to 
>>>>>>>>> make the object
>>>>>>>>> + * private to the specified VM.
>>>>>>>>> + *
>>>>>>>>> + * See struct drm_i915_gem_create_ext.
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_gem_create_ext_vm_private {
>>>>>>>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>>>>>>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>>>>>>>> +    struct i915_user_extension base;
>>>>>>>>> +
>>>>>>>>> +    /** @vm_id: Id of the VM to which the object is private */
>>>>>>>>> +    __u32 vm_id;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory 
>>>>>>>>> fence.
>>>>>>>>> + *
>>>>>>>>> + * User/Memory fence can be woken up either by:
>>>>>>>>> + *
>>>>>>>>> + * 1. GPU context indicated by @ctx_id, or,
>>>>>>>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>>>>>>>> + *    @ctx_id is ignored when this flag is set.
>>>>>>>>> + *
>>>>>>>>> + * Wakeup condition is,
>>>>>>>>> + * ``((*addr & mask) op (value & mask))``
>>>>>>>>> + *
>>>>>>>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>>>>>>>> <indefinite_dma_fences>`
>>>>>>>>> + */
>>>>>>>>> +struct drm_i915_gem_wait_user_fence {
>>>>>>>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>>>>>>>> +    __u64 extensions;
>>>>>>>>> +
>>>>>>>>> +    /** @addr: User/Memory fence address */
>>>>>>>>> +    __u64 addr;
>>>>>>>>> +
>>>>>>>>> +    /** @ctx_id: Id of the Context which will signal the 
>>>>>>>>> fence. */
>>>>>>>>> +    __u32 ctx_id;
>>>>>>>>> +
>>>>>>>>> +    /** @op: Wakeup condition operator */
>>>>>>>>> +    __u16 op;
>>>>>>>>> +#define I915_UFENCE_WAIT_EQ      0
>>>>>>>>> +#define I915_UFENCE_WAIT_NEQ     1
>>>>>>>>> +#define I915_UFENCE_WAIT_GT      2
>>>>>>>>> +#define I915_UFENCE_WAIT_GTE     3
>>>>>>>>> +#define I915_UFENCE_WAIT_LT      4
>>>>>>>>> +#define I915_UFENCE_WAIT_LTE     5
>>>>>>>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>>>>>>>> +#define I915_UFENCE_WAIT_AFTER   7
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @flags: Supported flags are,
>>>>>>>>> +     *
>>>>>>>>> +     * I915_UFENCE_WAIT_SOFT:
>>>>>>>>> +     *
>>>>>>>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>>>>>>>> +     *
>>>>>>>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>>>>>>>> +     *
>>>>>>>>> +     * Wait timeout specified as absolute time.
>>>>>>>>> +     */
>>>>>>>>> +    __u16 flags;
>>>>>>>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>>>>>>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>>>>>>>> +
>>>>>>>>> +    /** @value: Wakeup value */
>>>>>>>>> +    __u64 value;
>>>>>>>>> +
>>>>>>>>> +    /** @mask: Wakeup mask */
>>>>>>>>> +    __u64 mask;
>>>>>>>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>>>>>>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>>>>>>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>>>>>>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @timeout: Wait timeout in nanoseconds.
>>>>>>>>> +     *
>>>>>>>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time 
>>>>>>>>> timeout is the
>>>>>>>>> +     * absolute time in nsec.
>>>>>>>>> +     */
>>>>>>>>> +    __s64 timeout;
>>>>>>>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-10  8:18                                     ` Lionel Landwerlin
@ 2022-06-10 17:42                                       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-10 17:42 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>
>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>    <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>>>      >
>>>>>      >
>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>>>Vishwanathapura
>>>>>      wrote:
>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason 
>>>>>Ekstrand wrote:
>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>      >>>>
>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
>>>>>Landwerlin
>>>>>      wrote:
>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>      >>>>   >
>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>>>Vishwanathapura
>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>>      >>>>   >
>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>>      >>>>Brost wrote:
>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>>>      Landwerlin
>>>>>      >>>>   wrote:
>>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>      wrote:
>>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>      >>>>   binding/unbinding
>>>>>      >>>>   >       the mapping in an
>>>>>      >>>>   >       >> > +async worker. The binding and 
>>>>>unbinding will
>>>>>      >>>>work like a
>>>>>      >>>>   special
>>>>>      >>>>   >       GPU engine.
>>>>>      >>>>   >       >> > +The binding and unbinding operations are
>>>>>      serialized and
>>>>>      >>>>   will
>>>>>      >>>>   >       wait on specified
>>>>>      >>>>   >       >> > +input fences before the operation 
>>>>>and will signal
>>>>>      the
>>>>>      >>>>   output
>>>>>      >>>>   >       fences upon the
>>>>>      >>>>   >       >> > +completion of the operation. Due to
>>>>>      serialization,
>>>>>      >>>>   completion of
>>>>>      >>>>   >       an operation
>>>>>      >>>>   >       >> > +will also indicate that all 
>>>>>previous operations
>>>>>      >>>>are also
>>>>>      >>>>   >       complete.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I guess we should avoid saying "will 
>>>>>immediately
>>>>>      start
>>>>>      >>>>   >       binding/unbinding" if
>>>>>      >>>>   >       >> there are fences involved.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> And the fact that it's happening in an async
>>>>>      >>>>worker seem to
>>>>>      >>>>   imply
>>>>>      >>>>   >       it's not
>>>>>      >>>>   >       >> immediate.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >
>>>>>      >>>>   >       Ok, will fix.
>>>>>      >>>>   >       This was added because in earlier design 
>>>>>binding was
>>>>>      deferred
>>>>>      >>>>   until
>>>>>      >>>>   >       next execbuff.
>>>>>      >>>>   >       But now it is non-deferred (immediate in 
>>>>>that sense).
>>>>>      >>>>But yah,
>>>>>      >>>>   this is
>>>>>      >>>>   >       confusing
>>>>>      >>>>   >       and will fix it.
>>>>>      >>>>   >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I have a question on the behavior of the bind
>>>>>      >>>>operation when
>>>>>      >>>>   no
>>>>>      >>>>   >       input fence
>>>>>      >>>>   >       >> is provided. Let say I do :
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In what order are the fences going to 
>>>>>be signaled?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out 
>>>>>of order?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Because you wrote "serialized I assume 
>>>>>it's : in
>>>>>      order
>>>>>      >>>>   >       >>
>>>>>      >>>>   >
>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND 
>>>>>ioctls. Note that
>>>>>      >>>>bind and
>>>>>      >>>>   unbind
>>>>>      >>>>   >       will use
>>>>>      >>>>   >       the same queue and hence are ordered.
>>>>>      >>>>   >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> One thing I didn't realize is that 
>>>>>because we only
>>>>>      get one
>>>>>      >>>>   >       "VM_BIND" engine,
>>>>>      >>>>   >       >> there is a disconnect from the Vulkan 
>>>>>specification.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In Vulkan VM_BIND operations are 
>>>>>serialized but
>>>>>      >>>>per engine.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> So you could have something like this :
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>>      out_fence=fence2)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>>      out_fence=fence4)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> fence1 is not signaled
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> fence3 is signaled
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>>>>>      >>>>first VM_BIND.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I guess we can deal with that scenario in
>>>>>      >>>>userspace by doing
>>>>>      >>>>   the
>>>>>      >>>>   >       wait
>>>>>      >>>>   >       >> ourselves in one thread per engines.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> But then it makes the VM_BIND input 
>>>>>fences useless.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Daniel : what do you think? Should be 
>>>>>rework this or
>>>>>      just
>>>>>      >>>>   deal with
>>>>>      >>>>   >       wait
>>>>>      >>>>   >       >> fences in userspace?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >My opinion is rework this but make the 
>>>>>ordering via
>>>>>      >>>>an engine
>>>>>      >>>>   param
>>>>>      >>>>   >       optional.
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >e.g. A VM can be configured so all binds 
>>>>>are ordered
>>>>>      >>>>within the
>>>>>      >>>>   VM
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >e.g. A VM can be configured so all binds 
>>>>>accept an
>>>>>      engine
>>>>>      >>>>   argument
>>>>>      >>>>   >       (in
>>>>>      >>>>   >       >the case of the i915 likely this is a 
>>>>>gem context
>>>>>      >>>>handle) and
>>>>>      >>>>   binds
>>>>>      >>>>   >       >ordered with respect to that engine.
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >This gives UMDs options as the later 
>>>>>likely consumes
>>>>>      >>>>more KMD
>>>>>      >>>>   >       resources
>>>>>      >>>>   >       >so if a different UMD can live with binds being
>>>>>      >>>>ordered within
>>>>>      >>>>   the VM
>>>>>      >>>>   >       >they can use a mode consuming less resources.
>>>>>      >>>>   >       >
>>>>>      >>>>   >
>>>>>      >>>>   >       I think we need to be careful here if we 
>>>>>are looking
>>>>>      for some
>>>>>      >>>>   out of
>>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>>>>>      >>>>   >       In-order completion means, in a batch of 
>>>>>binds and
>>>>>      >>>>unbinds to be
>>>>>      >>>>   >       completed in-order, user only needs to specify
>>>>>      >>>>in-fence for the
>>>>>      >>>>   >       first bind/unbind call and the our-fence 
>>>>>for the last
>>>>>      >>>>   bind/unbind
>>>>>      >>>>   >       call. Also, the VA released by an unbind 
>>>>>call can be
>>>>>      >>>>re-used by
>>>>>      >>>>   >       any subsequent bind call in that in-order batch.
>>>>>      >>>>   >
>>>>>      >>>>   >       These things will break if 
>>>>>binding/unbinding were to
>>>>>      >>>>be allowed
>>>>>      >>>>   to
>>>>>      >>>>   >       go out of order (of submission) and user 
>>>>>need to be
>>>>>      extra
>>>>>      >>>>   careful
>>>>>      >>>>   >       not to run into pre-mature triggereing of 
>>>>>out-fence and
>>>>>      bind
>>>>>      >>>>   failing
>>>>>      >>>>   >       as VA is still in use etc.
>>>>>      >>>>   >
>>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>>>      specified
>>>>>      >>>>   address
>>>>>      >>>>   >       space
>>>>>      >>>>   >       (VM). So, the uapi is not engine/context 
>>>>>specific.
>>>>>      >>>>   >
>>>>>      >>>>   >       We can however add a 'queue' to the uapi 
>>>>>which can be
>>>>>      >>>>one from
>>>>>      >>>>   the
>>>>>      >>>>   >       pre-defined queues,
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>>>>>      >>>>   >       ...
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>>      >>>>   >
>>>>>      >>>>   >       KMD will spawn an async work queue for 
>>>>>each queue which
>>>>>      will
>>>>>      >>>>   only
>>>>>      >>>>   >       bind the mappings on that queue in the order of
>>>>>      submission.
>>>>>      >>>>   >       User can assign the queue to per engine 
>>>>>or anything
>>>>>      >>>>like that.
>>>>>      >>>>   >
>>>>>      >>>>   >       But again here, user need to be careful and not
>>>>>      >>>>deadlock these
>>>>>      >>>>   >       queues with circular dependency of fences.
>>>>>      >>>>   >
>>>>>      >>>>   >       I prefer adding this later an as 
>>>>>extension based on
>>>>>      >>>>whether it
>>>>>      >>>>   >       is really helping with the implementation.
>>>>>      >>>>   >
>>>>>      >>>>   >     I can tell you right now that having 
>>>>>everything on a
>>>>>      single
>>>>>      >>>>   in-order
>>>>>      >>>>   >     queue will not get us the perf we want.  
>>>>>What vulkan
>>>>>      >>>>really wants
>>>>>      >>>>   is one
>>>>>      >>>>   >     of two things:
>>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>>>      happen in
>>>>>      >>>>   whatever
>>>>>      >>>>   >     their dependencies are resolved and we 
>>>>>ensure ordering
>>>>>      >>>>ourselves
>>>>>      >>>>   by
>>>>>      >>>>   >     having a syncobj in the VkQueue.
>>>>>      >>>>   >      2. The ability to create multiple VM_BIND 
>>>>>queues.  We
>>>>>      need at
>>>>>      >>>>   least 2
>>>>>      >>>>   >     but I don't see why there needs to be a 
>>>>>limit besides
>>>>>      >>>>the limits
>>>>>      >>>>   the
>>>>>      >>>>   >     i915 API already has on the number of 
>>>>>engines.  Vulkan
>>>>>      could
>>>>>      >>>>   expose
>>>>>      >>>>   >     multiple sparse binding queues to the 
>>>>>client if it's not
>>>>>      >>>>   arbitrarily
>>>>>      >>>>   >     limited.
>>>>>      >>>>
>>>>>      >>>>   Thanks Jason, Lionel.
>>>>>      >>>>
>>>>>      >>>>   Jason, what are you referring to when you say 
>>>>>"limits the i915
>>>>>      API
>>>>>      >>>>   already
>>>>>      >>>>   has on the number of engines"? I am not sure if 
>>>>>there is such
>>>>>      an uapi
>>>>>      >>>>   today.
>>>>>      >>>>
>>>>>      >>>> There's a limit of something like 64 total engines 
>>>>>today based on
>>>>>      the
>>>>>      >>>> number of bits we can cram into the exec flags in 
>>>>>execbuffer2.  I
>>>>>      think
>>>>>      >>>> someone had an extended version that allowed more 
>>>>>but I ripped it
>>>>>      out
>>>>>      >>>> because no one was using it.  Of course, 
>>>>>execbuffer3 might not
>>>>>      >>>>have that
>>>>>      >>>> problem at all.
>>>>>      >>>>
>>>>>      >>>
>>>>>      >>>Thanks Jason.
>>>>>      >>>Ok, I am not sure which exec flag is that, but yah, 
>>>>>execbuffer3
>>>>>      probably
>>>>>      >>>will not have this limiation. So, we need to define a
>>>>>      VM_BIND_MAX_QUEUE
>>>>>      >>>and somehow export it to user (I am thinking of 
>>>>>embedding it in
>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>>>      meaning 2^n
>>>>>      >>>queues.
>>>>>      >>
>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK 
>>>>>(0x3f) which
>>>>>      execbuf3
>>>>>
>>>>>    Yup!  That's exactly the limit I was talking about.
>>>>>
>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>>>      structures,
>>>>>      >>
>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>>      >>        __u32 queue;
>>>>>      >>
>>>>>      >>I think that will keep things simple.
>>>>>      >
>>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>>>>>      >hardware can have? I suggest not to do that.
>>>>>      >
>>>>>      >Change with added this:
>>>>>      >
>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>>      >               return -EINVAL;
>>>>>      >
>>>>>      >To context creation needs to be undone and so let users 
>>>>>create engine
>>>>>      >maps with all hardware engines, and let execbuf3 access 
>>>>>them all.
>>>>>      >
>>>>>
>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>>>execbuff3 also.
>>>>>      Hence, I was using the same limit for VM_BIND queues 
>>>>>(64, or 65 if we
>>>>>      make it N+1).
>>>>>      But, as discussed in other thread of this RFC series, we 
>>>>>are planning
>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>>>      any uapi that limits the number of engines (and hence 
>>>>>the vm_bind
>>>>>      queues
>>>>>      need to be supported).
>>>>>
>>>>>      If we leave the number of vm_bind queues to be arbitrarily large
>>>>>      (__u32 queue_idx) then, we need to have a hashmap for 
>>>>>queue (a wq,
>>>>>      work_item and a linked list) lookup from the user 
>>>>>specified queue
>>>>>      index.
>>>>>      Other option is to just put some hard limit (say 64 or 
>>>>>65) and use
>>>>>      an array of queues in VM (each created upon first use). 
>>>>>I prefer this.
>>>>>
>>>>>    I don't get why a VM_BIND queue is any different from any 
>>>>>other queue or
>>>>>    userspace-visible kernel object.  But I'll leave those 
>>>>>details up to
>>>>>    danvet or whoever else might be reviewing the implementation.
>>>>>    --Jason
>>>>>
>>>>>  I kind of agree here. Wouldn't be simpler to have the bind 
>>>>>queue created
>>>>>  like the others when we build the engine map?
>>>>>
>>>>>  For userspace it's then just matter of selecting the right 
>>>>>queue ID when
>>>>>  submitting.
>>>>>
>>>>>  If there is ever a possibility to have this work on the GPU, 
>>>>>it would be
>>>>>  all ready.
>>>>>
>>>>
>>>>I did sync offline with Matt Brost on this.
>>>>We can add a VM_BIND engine class and let user create VM_BIND 
>>>>engines (queues).
>>>>The problem is, in i915 engine creating interface is bound to 
>>>>gem_context.
>>>>So, in vm_bind ioctl, we would need both context_id and 
>>>>queue_idx for proper
>>>>lookup of the user created engine. This is bit ackward as vm_bind is an
>>>>interface to VM (address space) and has nothing to do with gem_context.
>>>
>>>
>>>A gem_context has a single vm object right?
>>>
>>>Set through I915_CONTEXT_PARAM_VM at creation or given a default 
>>>one if not.
>>>
>>>So it's just like picking up the vm like it's done at execbuffer 
>>>time right now : eb->context->vm
>>>
>>
>>Are you suggesting replacing 'vm_id' with 'context_id' in the 
>>VM_BIND/UNBIND
>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be 
>>obtained
>>from the context?
>
>
>Yes, because if we go for engines, they're associated with a context 
>and so also associated with the VM bound to the context.
>

Hmm...context doesn't sould like the right interface. It should be
VM and engine (independent of context). Engine can be virtual or soft
engine (kernel thread), each with its own queue. We can add an interface
to create such engines (independent of context). But we are anway
implicitly creating it when user uses a new queue_idx. If in future
we have hardware engines for VM_BIND operation, we can have that
explicit inteface to create engine instances and the queue_index
in vm_bind/unbind will point to those engines.
Anyone has any thoughts? Daniel?

Niranjana

>
>>I think the interface is clean as a interface to VM. It is only that we
>>don't have a clean way to create a raw VM_BIND engine (not 
>>associated with
>>any context) with i915 uapi.
>>May be we can add such an interface, but I don't think that is worth it
>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I 
>>mentioned
>>above).
>>Anyone has any thoughts?
>>
>>>
>>>>Another problem is, if two VMs are binding with the same defined 
>>>>engine,
>>>>binding on VM1 can get unnecessary blocked by binding on VM2 
>>>>(which may be
>>>>waiting on its in_fence).
>>>
>>>
>>>Maybe I'm missing something, but how can you have 2 vm objects 
>>>with a single gem_context right now?
>>>
>>
>>No, we don't have 2 VMs for a gem_context.
>>Say if ctx1 with vm1 and ctx2 with vm2.
>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If 
>>those two queue indicies points to same underlying vm_bind engine,
>>then the second vm_bind call gets blocked until the first vm_bind call's
>>'in' fence is triggered and bind completes.
>>
>>With per VM queues, this is not a problem as two VMs will not endup
>>sharing same queue.
>>
>>BTW, I just posted a updated PATCH series.
>>https://www.spinics.net/lists/dri-devel/msg350483.html
>>
>>Niranjana
>>
>>>
>>>>
>>>>So, my preference here is to just add a 'u32 queue' index in 
>>>>vm_bind/unbind
>>>>ioctl, and the queues are per VM.
>>>>
>>>>Niranjana
>>>>
>>>>>  Thanks,
>>>>>
>>>>>  -Lionel
>>>>>
>>>>>
>>>>>      Niranjana
>>>>>
>>>>>      >Regards,
>>>>>      >
>>>>>      >Tvrtko
>>>>>      >
>>>>>      >>
>>>>>      >>Niranjana
>>>>>      >>
>>>>>      >>>
>>>>>      >>>>   I am trying to see how many queues we need and 
>>>>>don't want it to
>>>>>      be
>>>>>      >>>>   arbitrarily
>>>>>      >>>>   large and unduely blow up memory usage and 
>>>>>complexity in i915
>>>>>      driver.
>>>>>      >>>>
>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the 
>>>>>vast majority
>>>>>      >>>>of cases. I
>>>>>      >>>> could imagine a client wanting to create more than 1 sparse
>>>>>      >>>>queue in which
>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as 
>>>>>complexity
>>>>>      >>>>goes, once
>>>>>      >>>> you allow two, I don't think the complexity is going up by
>>>>>      >>>>allowing N.  As
>>>>>      >>>> for memory usage, creating more queues means more 
>>>>>memory.  That's
>>>>>      a
>>>>>      >>>> trade-off that userspace can make. Again, the 
>>>>>expected number
>>>>>      >>>>here is 1
>>>>>      >>>> or 2 in the vast majority of cases so I don't think 
>>>>>you need to
>>>>>      worry.
>>>>>      >>>
>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>>>      >>>That would require us create 8 workqueues.
>>>>>      >>>We can change 'n' later if required.
>>>>>      >>>
>>>>>      >>>Niranjana
>>>>>      >>>
>>>>>      >>>>
>>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>>      >>>>operations and we
>>>>>      >>>>   don't
>>>>>      >>>>   >     want any dependencies between them:
>>>>>      >>>>   >      1. Immediate.  These happen right after BO 
>>>>>creation or
>>>>>      >>>>maybe as
>>>>>      >>>>   part of
>>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>>      >>>>don't happen
>>>>>      >>>>   on a
>>>>>      >>>>   >     queue and we don't want them serialized 
>>>>>with anything.       To
>>>>>      >>>>   synchronize
>>>>>      >>>>   >     with submit, we'll have a syncobj in the 
>>>>>VkDevice which
>>>>>      is
>>>>>      >>>>   signaled by
>>>>>      >>>>   >     all immediate bind operations and make 
>>>>>submits wait on
>>>>>      it.
>>>>>      >>>>   >      2. Queued (sparse): These happen on a 
>>>>>VkQueue which may
>>>>>      be the
>>>>>      >>>>   same as
>>>>>      >>>>   >     a render/compute queue or may be its own 
>>>>>queue.  It's up
>>>>>      to us
>>>>>      >>>>   what we
>>>>>      >>>>   >     want to advertise.  From the Vulkan API 
>>>>>PoV, this is like
>>>>>      any
>>>>>      >>>>   other
>>>>>      >>>>   >     queue.  Operations on it wait on and signal 
>>>>>semaphores.       If we
>>>>>      >>>>   have a
>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>>      >>>>signal just like
>>>>>      >>>>   we do
>>>>>      >>>>   >     in execbuf().
>>>>>      >>>>   >     The important thing is that we don't want 
>>>>>one type of
>>>>>      >>>>operation to
>>>>>      >>>>   block
>>>>>      >>>>   >     on the other.  If immediate binds are 
>>>>>blocking on sparse
>>>>>      binds,
>>>>>      >>>>   it's
>>>>>      >>>>   >     going to cause over-synchronization issues.
>>>>>      >>>>   >     In terms of the internal implementation, I 
>>>>>know that
>>>>>      >>>>there's going
>>>>>      >>>>   to be
>>>>>      >>>>   >     a lock on the VM and that we can't actually 
>>>>>do these
>>>>>      things in
>>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
>>>>>      signaled and
>>>>>      >>>>   we're
>>>>>      >>>>
>>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>>>>>      >>>>multiple queues
>>>>>      >>>>   feeding to it.
>>>>>      >>>>
>>>>>      >>>> Right.  As long as the queues themselves are 
>>>>>independent and
>>>>>      >>>>can block on
>>>>>      >>>> dma_fences without holding up other queues, I think 
>>>>>we're fine.
>>>>>      >>>>
>>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>>>>>      >>>>there's a bit
>>>>>      >>>>   of
>>>>>      >>>>   >     synchronization due to locking.  That's 
>>>>>expected.  What
>>>>>      >>>>we can't
>>>>>      >>>>   afford
>>>>>      >>>>   >     to have is an immediate bind operation 
>>>>>suddenly blocking
>>>>>      on a
>>>>>      >>>>   sparse
>>>>>      >>>>   >     operation which is blocked on a compute job 
>>>>>that's going
>>>>>      to run
>>>>>      >>>>   for
>>>>>      >>>>   >     another 5ms.
>>>>>      >>>>
>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>>>doesn't block
>>>>>      the
>>>>>      >>>>   VM_BIND
>>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>>>>>      wanted to
>>>>>      >>>>   clarify.
>>>>>      >>>>
>>>>>      >>>> Yes, that's what I would expect.
>>>>>      >>>> --Jason
>>>>>      >>>>
>>>>>      >>>>   Niranjana
>>>>>      >>>>
>>>>>      >>>>   >     For reference, Windows solves this by allowing
>>>>>      arbitrarily many
>>>>>      >>>>   paging
>>>>>      >>>>   >     queues (what they call a VM_BIND 
>>>>>engine/queue).  That
>>>>>      >>>>design works
>>>>>      >>>>   >     pretty well and solves the problems in 
>>>>>question.       >>>>Again, we could
>>>>>      >>>>   just
>>>>>      >>>>   >     make everything out-of-order and require 
>>>>>using syncobjs
>>>>>      >>>>to order
>>>>>      >>>>   things
>>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>>>      >>>>   >     One more note while I'm here: danvet said 
>>>>>something on
>>>>>      >>>>IRC about
>>>>>      >>>>   VM_BIND
>>>>>      >>>>   >     queues waiting for syncobjs to 
>>>>>materialize.  We don't
>>>>>      really
>>>>>      >>>>   want/need
>>>>>      >>>>   >     this.  We already have all the machinery in 
>>>>>userspace to
>>>>>      handle
>>>>>      >>>>   >     wait-before-signal and waiting for syncobj 
>>>>>fences to
>>>>>      >>>>materialize
>>>>>      >>>>   and
>>>>>      >>>>   >     that machinery is on by default.  It would actually
>>>>>      >>>>take MORE work
>>>>>      >>>>   in
>>>>>      >>>>   >     Mesa to turn it off and take advantage of 
>>>>>the kernel
>>>>>      >>>>being able to
>>>>>      >>>>   wait
>>>>>      >>>>   >     for syncobjs to materialize. Also, getting 
>>>>>that right is
>>>>>      >>>>   ridiculously
>>>>>      >>>>   >     hard and I really don't want to get it 
>>>>>wrong in kernel
>>>>>      >>>>space.   �� When we
>>>>>      >>>>   >     do memory fences, wait-before-signal will 
>>>>>be a thing.  We
>>>>>      don't
>>>>>      >>>>   need to
>>>>>      >>>>   >     try and make it a thing for syncobj.
>>>>>      >>>>   >     --Jason
>>>>>      >>>>   >
>>>>>      >>>>   >   Thanks Jason,
>>>>>      >>>>   >
>>>>>      >>>>   >   I missed the bit in the Vulkan spec that 
>>>>>we're allowed to
>>>>>      have a
>>>>>      >>>>   sparse
>>>>>      >>>>   >   queue that does not implement either graphics 
>>>>>or compute
>>>>>      >>>>operations
>>>>>      >>>>   :
>>>>>      >>>>   >
>>>>>      >>>>   >     "While some implementations may include
>>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>>      >>>>   >     support in queue families that also include
>>>>>      >>>>   >
>>>>>      >>>>   >      graphics and compute support, other 
>>>>>implementations may
>>>>>      only
>>>>>      >>>>   expose a
>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>>      >>>>   >
>>>>>      >>>>   >      family."
>>>>>      >>>>   >
>>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>>>>>      bind/unbind
>>>>>      >>>>   >   operations.
>>>>>      >>>>   >
>>>>>      >>>>   >   But yes we need another engine for the 
>>>>>immediate/non-sparse
>>>>>      >>>>   operations.
>>>>>      >>>>   >
>>>>>      >>>>   >   -Lionel
>>>>>      >>>>   >
>>>>>      >>>>   >         >
>>>>>      >>>>   >       Daniel, any thoughts?
>>>>>      >>>>   >
>>>>>      >>>>   >       Niranjana
>>>>>      >>>>   >
>>>>>      >>>>   >       >Matt
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Sorry I noticed this late.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> -Lionel
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>
>>>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-10 17:42                                       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-10 17:42 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>
>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>    <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>>>      >
>>>>>      >
>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>>>Vishwanathapura
>>>>>      wrote:
>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason 
>>>>>Ekstrand wrote:
>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>      >>>>
>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
>>>>>Landwerlin
>>>>>      wrote:
>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>      >>>>   >
>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>>>Vishwanathapura
>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>>      >>>>   >
>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>>      >>>>Brost wrote:
>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>>>      Landwerlin
>>>>>      >>>>   wrote:
>>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>      wrote:
>>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>      >>>>   binding/unbinding
>>>>>      >>>>   >       the mapping in an
>>>>>      >>>>   >       >> > +async worker. The binding and 
>>>>>unbinding will
>>>>>      >>>>work like a
>>>>>      >>>>   special
>>>>>      >>>>   >       GPU engine.
>>>>>      >>>>   >       >> > +The binding and unbinding operations are
>>>>>      serialized and
>>>>>      >>>>   will
>>>>>      >>>>   >       wait on specified
>>>>>      >>>>   >       >> > +input fences before the operation 
>>>>>and will signal
>>>>>      the
>>>>>      >>>>   output
>>>>>      >>>>   >       fences upon the
>>>>>      >>>>   >       >> > +completion of the operation. Due to
>>>>>      serialization,
>>>>>      >>>>   completion of
>>>>>      >>>>   >       an operation
>>>>>      >>>>   >       >> > +will also indicate that all 
>>>>>previous operations
>>>>>      >>>>are also
>>>>>      >>>>   >       complete.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I guess we should avoid saying "will 
>>>>>immediately
>>>>>      start
>>>>>      >>>>   >       binding/unbinding" if
>>>>>      >>>>   >       >> there are fences involved.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> And the fact that it's happening in an async
>>>>>      >>>>worker seem to
>>>>>      >>>>   imply
>>>>>      >>>>   >       it's not
>>>>>      >>>>   >       >> immediate.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >
>>>>>      >>>>   >       Ok, will fix.
>>>>>      >>>>   >       This was added because in earlier design 
>>>>>binding was
>>>>>      deferred
>>>>>      >>>>   until
>>>>>      >>>>   >       next execbuff.
>>>>>      >>>>   >       But now it is non-deferred (immediate in 
>>>>>that sense).
>>>>>      >>>>But yah,
>>>>>      >>>>   this is
>>>>>      >>>>   >       confusing
>>>>>      >>>>   >       and will fix it.
>>>>>      >>>>   >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I have a question on the behavior of the bind
>>>>>      >>>>operation when
>>>>>      >>>>   no
>>>>>      >>>>   >       input fence
>>>>>      >>>>   >       >> is provided. Let say I do :
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In what order are the fences going to 
>>>>>be signaled?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out 
>>>>>of order?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Because you wrote "serialized I assume 
>>>>>it's : in
>>>>>      order
>>>>>      >>>>   >       >>
>>>>>      >>>>   >
>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND 
>>>>>ioctls. Note that
>>>>>      >>>>bind and
>>>>>      >>>>   unbind
>>>>>      >>>>   >       will use
>>>>>      >>>>   >       the same queue and hence are ordered.
>>>>>      >>>>   >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> One thing I didn't realize is that 
>>>>>because we only
>>>>>      get one
>>>>>      >>>>   >       "VM_BIND" engine,
>>>>>      >>>>   >       >> there is a disconnect from the Vulkan 
>>>>>specification.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In Vulkan VM_BIND operations are 
>>>>>serialized but
>>>>>      >>>>per engine.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> So you could have something like this :
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>>      out_fence=fence2)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>>      out_fence=fence4)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> fence1 is not signaled
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> fence3 is signaled
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>>>>>      >>>>first VM_BIND.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I guess we can deal with that scenario in
>>>>>      >>>>userspace by doing
>>>>>      >>>>   the
>>>>>      >>>>   >       wait
>>>>>      >>>>   >       >> ourselves in one thread per engines.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> But then it makes the VM_BIND input 
>>>>>fences useless.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Daniel : what do you think? Should be 
>>>>>rework this or
>>>>>      just
>>>>>      >>>>   deal with
>>>>>      >>>>   >       wait
>>>>>      >>>>   >       >> fences in userspace?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >My opinion is rework this but make the 
>>>>>ordering via
>>>>>      >>>>an engine
>>>>>      >>>>   param
>>>>>      >>>>   >       optional.
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >e.g. A VM can be configured so all binds 
>>>>>are ordered
>>>>>      >>>>within the
>>>>>      >>>>   VM
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >e.g. A VM can be configured so all binds 
>>>>>accept an
>>>>>      engine
>>>>>      >>>>   argument
>>>>>      >>>>   >       (in
>>>>>      >>>>   >       >the case of the i915 likely this is a 
>>>>>gem context
>>>>>      >>>>handle) and
>>>>>      >>>>   binds
>>>>>      >>>>   >       >ordered with respect to that engine.
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >This gives UMDs options as the later 
>>>>>likely consumes
>>>>>      >>>>more KMD
>>>>>      >>>>   >       resources
>>>>>      >>>>   >       >so if a different UMD can live with binds being
>>>>>      >>>>ordered within
>>>>>      >>>>   the VM
>>>>>      >>>>   >       >they can use a mode consuming less resources.
>>>>>      >>>>   >       >
>>>>>      >>>>   >
>>>>>      >>>>   >       I think we need to be careful here if we 
>>>>>are looking
>>>>>      for some
>>>>>      >>>>   out of
>>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>>>>>      >>>>   >       In-order completion means, in a batch of 
>>>>>binds and
>>>>>      >>>>unbinds to be
>>>>>      >>>>   >       completed in-order, user only needs to specify
>>>>>      >>>>in-fence for the
>>>>>      >>>>   >       first bind/unbind call and the our-fence 
>>>>>for the last
>>>>>      >>>>   bind/unbind
>>>>>      >>>>   >       call. Also, the VA released by an unbind 
>>>>>call can be
>>>>>      >>>>re-used by
>>>>>      >>>>   >       any subsequent bind call in that in-order batch.
>>>>>      >>>>   >
>>>>>      >>>>   >       These things will break if 
>>>>>binding/unbinding were to
>>>>>      >>>>be allowed
>>>>>      >>>>   to
>>>>>      >>>>   >       go out of order (of submission) and user 
>>>>>need to be
>>>>>      extra
>>>>>      >>>>   careful
>>>>>      >>>>   >       not to run into pre-mature triggereing of 
>>>>>out-fence and
>>>>>      bind
>>>>>      >>>>   failing
>>>>>      >>>>   >       as VA is still in use etc.
>>>>>      >>>>   >
>>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>>>      specified
>>>>>      >>>>   address
>>>>>      >>>>   >       space
>>>>>      >>>>   >       (VM). So, the uapi is not engine/context 
>>>>>specific.
>>>>>      >>>>   >
>>>>>      >>>>   >       We can however add a 'queue' to the uapi 
>>>>>which can be
>>>>>      >>>>one from
>>>>>      >>>>   the
>>>>>      >>>>   >       pre-defined queues,
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>>>>>      >>>>   >       ...
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>>      >>>>   >
>>>>>      >>>>   >       KMD will spawn an async work queue for 
>>>>>each queue which
>>>>>      will
>>>>>      >>>>   only
>>>>>      >>>>   >       bind the mappings on that queue in the order of
>>>>>      submission.
>>>>>      >>>>   >       User can assign the queue to per engine 
>>>>>or anything
>>>>>      >>>>like that.
>>>>>      >>>>   >
>>>>>      >>>>   >       But again here, user need to be careful and not
>>>>>      >>>>deadlock these
>>>>>      >>>>   >       queues with circular dependency of fences.
>>>>>      >>>>   >
>>>>>      >>>>   >       I prefer adding this later an as 
>>>>>extension based on
>>>>>      >>>>whether it
>>>>>      >>>>   >       is really helping with the implementation.
>>>>>      >>>>   >
>>>>>      >>>>   >     I can tell you right now that having 
>>>>>everything on a
>>>>>      single
>>>>>      >>>>   in-order
>>>>>      >>>>   >     queue will not get us the perf we want.  
>>>>>What vulkan
>>>>>      >>>>really wants
>>>>>      >>>>   is one
>>>>>      >>>>   >     of two things:
>>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>>>      happen in
>>>>>      >>>>   whatever
>>>>>      >>>>   >     their dependencies are resolved and we 
>>>>>ensure ordering
>>>>>      >>>>ourselves
>>>>>      >>>>   by
>>>>>      >>>>   >     having a syncobj in the VkQueue.
>>>>>      >>>>   >      2. The ability to create multiple VM_BIND 
>>>>>queues.  We
>>>>>      need at
>>>>>      >>>>   least 2
>>>>>      >>>>   >     but I don't see why there needs to be a 
>>>>>limit besides
>>>>>      >>>>the limits
>>>>>      >>>>   the
>>>>>      >>>>   >     i915 API already has on the number of 
>>>>>engines.  Vulkan
>>>>>      could
>>>>>      >>>>   expose
>>>>>      >>>>   >     multiple sparse binding queues to the 
>>>>>client if it's not
>>>>>      >>>>   arbitrarily
>>>>>      >>>>   >     limited.
>>>>>      >>>>
>>>>>      >>>>   Thanks Jason, Lionel.
>>>>>      >>>>
>>>>>      >>>>   Jason, what are you referring to when you say 
>>>>>"limits the i915
>>>>>      API
>>>>>      >>>>   already
>>>>>      >>>>   has on the number of engines"? I am not sure if 
>>>>>there is such
>>>>>      an uapi
>>>>>      >>>>   today.
>>>>>      >>>>
>>>>>      >>>> There's a limit of something like 64 total engines 
>>>>>today based on
>>>>>      the
>>>>>      >>>> number of bits we can cram into the exec flags in 
>>>>>execbuffer2.  I
>>>>>      think
>>>>>      >>>> someone had an extended version that allowed more 
>>>>>but I ripped it
>>>>>      out
>>>>>      >>>> because no one was using it.  Of course, 
>>>>>execbuffer3 might not
>>>>>      >>>>have that
>>>>>      >>>> problem at all.
>>>>>      >>>>
>>>>>      >>>
>>>>>      >>>Thanks Jason.
>>>>>      >>>Ok, I am not sure which exec flag is that, but yah, 
>>>>>execbuffer3
>>>>>      probably
>>>>>      >>>will not have this limiation. So, we need to define a
>>>>>      VM_BIND_MAX_QUEUE
>>>>>      >>>and somehow export it to user (I am thinking of 
>>>>>embedding it in
>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>>>      meaning 2^n
>>>>>      >>>queues.
>>>>>      >>
>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK 
>>>>>(0x3f) which
>>>>>      execbuf3
>>>>>
>>>>>    Yup!  That's exactly the limit I was talking about.
>>>>>
>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>>>      structures,
>>>>>      >>
>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>>      >>        __u32 queue;
>>>>>      >>
>>>>>      >>I think that will keep things simple.
>>>>>      >
>>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>>>>>      >hardware can have? I suggest not to do that.
>>>>>      >
>>>>>      >Change with added this:
>>>>>      >
>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>>      >               return -EINVAL;
>>>>>      >
>>>>>      >To context creation needs to be undone and so let users 
>>>>>create engine
>>>>>      >maps with all hardware engines, and let execbuf3 access 
>>>>>them all.
>>>>>      >
>>>>>
>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>>>execbuff3 also.
>>>>>      Hence, I was using the same limit for VM_BIND queues 
>>>>>(64, or 65 if we
>>>>>      make it N+1).
>>>>>      But, as discussed in other thread of this RFC series, we 
>>>>>are planning
>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>>>      any uapi that limits the number of engines (and hence 
>>>>>the vm_bind
>>>>>      queues
>>>>>      need to be supported).
>>>>>
>>>>>      If we leave the number of vm_bind queues to be arbitrarily large
>>>>>      (__u32 queue_idx) then, we need to have a hashmap for 
>>>>>queue (a wq,
>>>>>      work_item and a linked list) lookup from the user 
>>>>>specified queue
>>>>>      index.
>>>>>      Other option is to just put some hard limit (say 64 or 
>>>>>65) and use
>>>>>      an array of queues in VM (each created upon first use). 
>>>>>I prefer this.
>>>>>
>>>>>    I don't get why a VM_BIND queue is any different from any 
>>>>>other queue or
>>>>>    userspace-visible kernel object.  But I'll leave those 
>>>>>details up to
>>>>>    danvet or whoever else might be reviewing the implementation.
>>>>>    --Jason
>>>>>
>>>>>  I kind of agree here. Wouldn't be simpler to have the bind 
>>>>>queue created
>>>>>  like the others when we build the engine map?
>>>>>
>>>>>  For userspace it's then just matter of selecting the right 
>>>>>queue ID when
>>>>>  submitting.
>>>>>
>>>>>  If there is ever a possibility to have this work on the GPU, 
>>>>>it would be
>>>>>  all ready.
>>>>>
>>>>
>>>>I did sync offline with Matt Brost on this.
>>>>We can add a VM_BIND engine class and let user create VM_BIND 
>>>>engines (queues).
>>>>The problem is, in i915 engine creating interface is bound to 
>>>>gem_context.
>>>>So, in vm_bind ioctl, we would need both context_id and 
>>>>queue_idx for proper
>>>>lookup of the user created engine. This is bit ackward as vm_bind is an
>>>>interface to VM (address space) and has nothing to do with gem_context.
>>>
>>>
>>>A gem_context has a single vm object right?
>>>
>>>Set through I915_CONTEXT_PARAM_VM at creation or given a default 
>>>one if not.
>>>
>>>So it's just like picking up the vm like it's done at execbuffer 
>>>time right now : eb->context->vm
>>>
>>
>>Are you suggesting replacing 'vm_id' with 'context_id' in the 
>>VM_BIND/UNBIND
>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be 
>>obtained
>>from the context?
>
>
>Yes, because if we go for engines, they're associated with a context 
>and so also associated with the VM bound to the context.
>

Hmm...context doesn't sould like the right interface. It should be
VM and engine (independent of context). Engine can be virtual or soft
engine (kernel thread), each with its own queue. We can add an interface
to create such engines (independent of context). But we are anway
implicitly creating it when user uses a new queue_idx. If in future
we have hardware engines for VM_BIND operation, we can have that
explicit inteface to create engine instances and the queue_index
in vm_bind/unbind will point to those engines.
Anyone has any thoughts? Daniel?

Niranjana

>
>>I think the interface is clean as a interface to VM. It is only that we
>>don't have a clean way to create a raw VM_BIND engine (not 
>>associated with
>>any context) with i915 uapi.
>>May be we can add such an interface, but I don't think that is worth it
>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I 
>>mentioned
>>above).
>>Anyone has any thoughts?
>>
>>>
>>>>Another problem is, if two VMs are binding with the same defined 
>>>>engine,
>>>>binding on VM1 can get unnecessary blocked by binding on VM2 
>>>>(which may be
>>>>waiting on its in_fence).
>>>
>>>
>>>Maybe I'm missing something, but how can you have 2 vm objects 
>>>with a single gem_context right now?
>>>
>>
>>No, we don't have 2 VMs for a gem_context.
>>Say if ctx1 with vm1 and ctx2 with vm2.
>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If 
>>those two queue indicies points to same underlying vm_bind engine,
>>then the second vm_bind call gets blocked until the first vm_bind call's
>>'in' fence is triggered and bind completes.
>>
>>With per VM queues, this is not a problem as two VMs will not endup
>>sharing same queue.
>>
>>BTW, I just posted a updated PATCH series.
>>https://www.spinics.net/lists/dri-devel/msg350483.html
>>
>>Niranjana
>>
>>>
>>>>
>>>>So, my preference here is to just add a 'u32 queue' index in 
>>>>vm_bind/unbind
>>>>ioctl, and the queues are per VM.
>>>>
>>>>Niranjana
>>>>
>>>>>  Thanks,
>>>>>
>>>>>  -Lionel
>>>>>
>>>>>
>>>>>      Niranjana
>>>>>
>>>>>      >Regards,
>>>>>      >
>>>>>      >Tvrtko
>>>>>      >
>>>>>      >>
>>>>>      >>Niranjana
>>>>>      >>
>>>>>      >>>
>>>>>      >>>>   I am trying to see how many queues we need and 
>>>>>don't want it to
>>>>>      be
>>>>>      >>>>   arbitrarily
>>>>>      >>>>   large and unduely blow up memory usage and 
>>>>>complexity in i915
>>>>>      driver.
>>>>>      >>>>
>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the 
>>>>>vast majority
>>>>>      >>>>of cases. I
>>>>>      >>>> could imagine a client wanting to create more than 1 sparse
>>>>>      >>>>queue in which
>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as 
>>>>>complexity
>>>>>      >>>>goes, once
>>>>>      >>>> you allow two, I don't think the complexity is going up by
>>>>>      >>>>allowing N.  As
>>>>>      >>>> for memory usage, creating more queues means more 
>>>>>memory.  That's
>>>>>      a
>>>>>      >>>> trade-off that userspace can make. Again, the 
>>>>>expected number
>>>>>      >>>>here is 1
>>>>>      >>>> or 2 in the vast majority of cases so I don't think 
>>>>>you need to
>>>>>      worry.
>>>>>      >>>
>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>>>      >>>That would require us create 8 workqueues.
>>>>>      >>>We can change 'n' later if required.
>>>>>      >>>
>>>>>      >>>Niranjana
>>>>>      >>>
>>>>>      >>>>
>>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>>      >>>>operations and we
>>>>>      >>>>   don't
>>>>>      >>>>   >     want any dependencies between them:
>>>>>      >>>>   >      1. Immediate.  These happen right after BO 
>>>>>creation or
>>>>>      >>>>maybe as
>>>>>      >>>>   part of
>>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>>      >>>>don't happen
>>>>>      >>>>   on a
>>>>>      >>>>   >     queue and we don't want them serialized 
>>>>>with anything.       To
>>>>>      >>>>   synchronize
>>>>>      >>>>   >     with submit, we'll have a syncobj in the 
>>>>>VkDevice which
>>>>>      is
>>>>>      >>>>   signaled by
>>>>>      >>>>   >     all immediate bind operations and make 
>>>>>submits wait on
>>>>>      it.
>>>>>      >>>>   >      2. Queued (sparse): These happen on a 
>>>>>VkQueue which may
>>>>>      be the
>>>>>      >>>>   same as
>>>>>      >>>>   >     a render/compute queue or may be its own 
>>>>>queue.  It's up
>>>>>      to us
>>>>>      >>>>   what we
>>>>>      >>>>   >     want to advertise.  From the Vulkan API 
>>>>>PoV, this is like
>>>>>      any
>>>>>      >>>>   other
>>>>>      >>>>   >     queue.  Operations on it wait on and signal 
>>>>>semaphores.       If we
>>>>>      >>>>   have a
>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>>      >>>>signal just like
>>>>>      >>>>   we do
>>>>>      >>>>   >     in execbuf().
>>>>>      >>>>   >     The important thing is that we don't want 
>>>>>one type of
>>>>>      >>>>operation to
>>>>>      >>>>   block
>>>>>      >>>>   >     on the other.  If immediate binds are 
>>>>>blocking on sparse
>>>>>      binds,
>>>>>      >>>>   it's
>>>>>      >>>>   >     going to cause over-synchronization issues.
>>>>>      >>>>   >     In terms of the internal implementation, I 
>>>>>know that
>>>>>      >>>>there's going
>>>>>      >>>>   to be
>>>>>      >>>>   >     a lock on the VM and that we can't actually 
>>>>>do these
>>>>>      things in
>>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
>>>>>      signaled and
>>>>>      >>>>   we're
>>>>>      >>>>
>>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>>>>>      >>>>multiple queues
>>>>>      >>>>   feeding to it.
>>>>>      >>>>
>>>>>      >>>> Right.  As long as the queues themselves are 
>>>>>independent and
>>>>>      >>>>can block on
>>>>>      >>>> dma_fences without holding up other queues, I think 
>>>>>we're fine.
>>>>>      >>>>
>>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>>>>>      >>>>there's a bit
>>>>>      >>>>   of
>>>>>      >>>>   >     synchronization due to locking.  That's 
>>>>>expected.  What
>>>>>      >>>>we can't
>>>>>      >>>>   afford
>>>>>      >>>>   >     to have is an immediate bind operation 
>>>>>suddenly blocking
>>>>>      on a
>>>>>      >>>>   sparse
>>>>>      >>>>   >     operation which is blocked on a compute job 
>>>>>that's going
>>>>>      to run
>>>>>      >>>>   for
>>>>>      >>>>   >     another 5ms.
>>>>>      >>>>
>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>>>doesn't block
>>>>>      the
>>>>>      >>>>   VM_BIND
>>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>>>>>      wanted to
>>>>>      >>>>   clarify.
>>>>>      >>>>
>>>>>      >>>> Yes, that's what I would expect.
>>>>>      >>>> --Jason
>>>>>      >>>>
>>>>>      >>>>   Niranjana
>>>>>      >>>>
>>>>>      >>>>   >     For reference, Windows solves this by allowing
>>>>>      arbitrarily many
>>>>>      >>>>   paging
>>>>>      >>>>   >     queues (what they call a VM_BIND 
>>>>>engine/queue).  That
>>>>>      >>>>design works
>>>>>      >>>>   >     pretty well and solves the problems in 
>>>>>question.       >>>>Again, we could
>>>>>      >>>>   just
>>>>>      >>>>   >     make everything out-of-order and require 
>>>>>using syncobjs
>>>>>      >>>>to order
>>>>>      >>>>   things
>>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>>>      >>>>   >     One more note while I'm here: danvet said 
>>>>>something on
>>>>>      >>>>IRC about
>>>>>      >>>>   VM_BIND
>>>>>      >>>>   >     queues waiting for syncobjs to 
>>>>>materialize.  We don't
>>>>>      really
>>>>>      >>>>   want/need
>>>>>      >>>>   >     this.  We already have all the machinery in 
>>>>>userspace to
>>>>>      handle
>>>>>      >>>>   >     wait-before-signal and waiting for syncobj 
>>>>>fences to
>>>>>      >>>>materialize
>>>>>      >>>>   and
>>>>>      >>>>   >     that machinery is on by default.  It would actually
>>>>>      >>>>take MORE work
>>>>>      >>>>   in
>>>>>      >>>>   >     Mesa to turn it off and take advantage of 
>>>>>the kernel
>>>>>      >>>>being able to
>>>>>      >>>>   wait
>>>>>      >>>>   >     for syncobjs to materialize. Also, getting 
>>>>>that right is
>>>>>      >>>>   ridiculously
>>>>>      >>>>   >     hard and I really don't want to get it 
>>>>>wrong in kernel
>>>>>      >>>>space.   �� When we
>>>>>      >>>>   >     do memory fences, wait-before-signal will 
>>>>>be a thing.  We
>>>>>      don't
>>>>>      >>>>   need to
>>>>>      >>>>   >     try and make it a thing for syncobj.
>>>>>      >>>>   >     --Jason
>>>>>      >>>>   >
>>>>>      >>>>   >   Thanks Jason,
>>>>>      >>>>   >
>>>>>      >>>>   >   I missed the bit in the Vulkan spec that 
>>>>>we're allowed to
>>>>>      have a
>>>>>      >>>>   sparse
>>>>>      >>>>   >   queue that does not implement either graphics 
>>>>>or compute
>>>>>      >>>>operations
>>>>>      >>>>   :
>>>>>      >>>>   >
>>>>>      >>>>   >     "While some implementations may include
>>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>>      >>>>   >     support in queue families that also include
>>>>>      >>>>   >
>>>>>      >>>>   >      graphics and compute support, other 
>>>>>implementations may
>>>>>      only
>>>>>      >>>>   expose a
>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>>      >>>>   >
>>>>>      >>>>   >      family."
>>>>>      >>>>   >
>>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>>>>>      bind/unbind
>>>>>      >>>>   >   operations.
>>>>>      >>>>   >
>>>>>      >>>>   >   But yes we need another engine for the 
>>>>>immediate/non-sparse
>>>>>      >>>>   operations.
>>>>>      >>>>   >
>>>>>      >>>>   >   -Lionel
>>>>>      >>>>   >
>>>>>      >>>>   >         >
>>>>>      >>>>   >       Daniel, any thoughts?
>>>>>      >>>>   >
>>>>>      >>>>   >       Niranjana
>>>>>      >>>>   >
>>>>>      >>>>   >       >Matt
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Sorry I noticed this late.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> -Lionel
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>
>>>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-10 17:42                                       ` Niranjana Vishwanathapura
@ 2022-06-13 13:33                                         ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-13 13:33 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, Landwerlin, Lionel G
  Cc: Intel GFX, Wilson, Chris P, Hellstrom, Thomas,
	Maling list - DRI developers, Vetter, Daniel,
	Christian König



Regards,
Oak

> -----Original Message-----
> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Niranjana
> Vishwanathapura
> Sent: June 10, 2022 1:43 PM
> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Hellstrom, Thomas <thomas.hellstrom@intel.com>;
> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>
> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>    <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>
> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >>>>>      >
> >>>>>      >
> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>Vishwanathapura
> >>>>>      wrote:
> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>Ekstrand wrote:
> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>      >>>>
> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>Landwerlin
> >>>>>      wrote:
> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>      >>>>   >
> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>Vishwanathapura
> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>      >>>>   >
> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>>      >>>>Brost wrote:
> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
> >>>>>      Landwerlin
> >>>>>      >>>>   wrote:
> >>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>      wrote:
> >>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>      >>>>   binding/unbinding
> >>>>>      >>>>   >       the mapping in an
> >>>>>      >>>>   >       >> > +async worker. The binding and
> >>>>>unbinding will
> >>>>>      >>>>work like a
> >>>>>      >>>>   special
> >>>>>      >>>>   >       GPU engine.
> >>>>>      >>>>   >       >> > +The binding and unbinding operations are
> >>>>>      serialized and
> >>>>>      >>>>   will
> >>>>>      >>>>   >       wait on specified
> >>>>>      >>>>   >       >> > +input fences before the operation
> >>>>>and will signal
> >>>>>      the
> >>>>>      >>>>   output
> >>>>>      >>>>   >       fences upon the
> >>>>>      >>>>   >       >> > +completion of the operation. Due to
> >>>>>      serialization,
> >>>>>      >>>>   completion of
> >>>>>      >>>>   >       an operation
> >>>>>      >>>>   >       >> > +will also indicate that all
> >>>>>previous operations
> >>>>>      >>>>are also
> >>>>>      >>>>   >       complete.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I guess we should avoid saying "will
> >>>>>immediately
> >>>>>      start
> >>>>>      >>>>   >       binding/unbinding" if
> >>>>>      >>>>   >       >> there are fences involved.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> And the fact that it's happening in an async
> >>>>>      >>>>worker seem to
> >>>>>      >>>>   imply
> >>>>>      >>>>   >       it's not
> >>>>>      >>>>   >       >> immediate.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Ok, will fix.
> >>>>>      >>>>   >       This was added because in earlier design
> >>>>>binding was
> >>>>>      deferred
> >>>>>      >>>>   until
> >>>>>      >>>>   >       next execbuff.
> >>>>>      >>>>   >       But now it is non-deferred (immediate in
> >>>>>that sense).
> >>>>>      >>>>But yah,
> >>>>>      >>>>   this is
> >>>>>      >>>>   >       confusing
> >>>>>      >>>>   >       and will fix it.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I have a question on the behavior of the bind
> >>>>>      >>>>operation when
> >>>>>      >>>>   no
> >>>>>      >>>>   >       input fence
> >>>>>      >>>>   >       >> is provided. Let say I do :
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In what order are the fences going to
> >>>>>be signaled?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out
> >>>>>of order?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Because you wrote "serialized I assume
> >>>>>it's : in
> >>>>>      order
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> >>>>>ioctls. Note that
> >>>>>      >>>>bind and
> >>>>>      >>>>   unbind
> >>>>>      >>>>   >       will use
> >>>>>      >>>>   >       the same queue and hence are ordered.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> One thing I didn't realize is that
> >>>>>because we only
> >>>>>      get one
> >>>>>      >>>>   >       "VM_BIND" engine,
> >>>>>      >>>>   >       >> there is a disconnect from the Vulkan
> >>>>>specification.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In Vulkan VM_BIND operations are
> >>>>>serialized but
> >>>>>      >>>>per engine.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> So you could have something like this :
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
> >>>>>      out_fence=fence2)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
> >>>>>      out_fence=fence4)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> fence1 is not signaled
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> fence3 is signaled
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
> >>>>>      >>>>first VM_BIND.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I guess we can deal with that scenario in
> >>>>>      >>>>userspace by doing
> >>>>>      >>>>   the
> >>>>>      >>>>   >       wait
> >>>>>      >>>>   >       >> ourselves in one thread per engines.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> But then it makes the VM_BIND input
> >>>>>fences useless.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Daniel : what do you think? Should be
> >>>>>rework this or
> >>>>>      just
> >>>>>      >>>>   deal with
> >>>>>      >>>>   >       wait
> >>>>>      >>>>   >       >> fences in userspace?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >My opinion is rework this but make the
> >>>>>ordering via
> >>>>>      >>>>an engine
> >>>>>      >>>>   param
> >>>>>      >>>>   >       optional.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
> >>>>>are ordered
> >>>>>      >>>>within the
> >>>>>      >>>>   VM
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
> >>>>>accept an
> >>>>>      engine
> >>>>>      >>>>   argument
> >>>>>      >>>>   >       (in
> >>>>>      >>>>   >       >the case of the i915 likely this is a
> >>>>>gem context
> >>>>>      >>>>handle) and
> >>>>>      >>>>   binds
> >>>>>      >>>>   >       >ordered with respect to that engine.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >This gives UMDs options as the later
> >>>>>likely consumes
> >>>>>      >>>>more KMD
> >>>>>      >>>>   >       resources
> >>>>>      >>>>   >       >so if a different UMD can live with binds being
> >>>>>      >>>>ordered within
> >>>>>      >>>>   the VM
> >>>>>      >>>>   >       >they can use a mode consuming less resources.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >
> >>>>>      >>>>   >       I think we need to be careful here if we
> >>>>>are looking
> >>>>>      for some
> >>>>>      >>>>   out of
> >>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
> >>>>>      >>>>   >       In-order completion means, in a batch of
> >>>>>binds and
> >>>>>      >>>>unbinds to be
> >>>>>      >>>>   >       completed in-order, user only needs to specify
> >>>>>      >>>>in-fence for the
> >>>>>      >>>>   >       first bind/unbind call and the our-fence
> >>>>>for the last
> >>>>>      >>>>   bind/unbind
> >>>>>      >>>>   >       call. Also, the VA released by an unbind
> >>>>>call can be
> >>>>>      >>>>re-used by
> >>>>>      >>>>   >       any subsequent bind call in that in-order batch.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       These things will break if
> >>>>>binding/unbinding were to
> >>>>>      >>>>be allowed
> >>>>>      >>>>   to
> >>>>>      >>>>   >       go out of order (of submission) and user
> >>>>>need to be
> >>>>>      extra
> >>>>>      >>>>   careful
> >>>>>      >>>>   >       not to run into pre-mature triggereing of
> >>>>>out-fence and
> >>>>>      bind
> >>>>>      >>>>   failing
> >>>>>      >>>>   >       as VA is still in use etc.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
> >>>>>      specified
> >>>>>      >>>>   address
> >>>>>      >>>>   >       space
> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> >>>>>specific.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
> >>>>>which can be
> >>>>>      >>>>one from
> >>>>>      >>>>   the
> >>>>>      >>>>   >       pre-defined queues,
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
> >>>>>      >>>>   >       ...
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
> >>>>>      >>>>   >
> >>>>>      >>>>   >       KMD will spawn an async work queue for
> >>>>>each queue which
> >>>>>      will
> >>>>>      >>>>   only
> >>>>>      >>>>   >       bind the mappings on that queue in the order of
> >>>>>      submission.
> >>>>>      >>>>   >       User can assign the queue to per engine
> >>>>>or anything
> >>>>>      >>>>like that.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       But again here, user need to be careful and not
> >>>>>      >>>>deadlock these
> >>>>>      >>>>   >       queues with circular dependency of fences.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       I prefer adding this later an as
> >>>>>extension based on
> >>>>>      >>>>whether it
> >>>>>      >>>>   >       is really helping with the implementation.
> >>>>>      >>>>   >
> >>>>>      >>>>   >     I can tell you right now that having
> >>>>>everything on a
> >>>>>      single
> >>>>>      >>>>   in-order
> >>>>>      >>>>   >     queue will not get us the perf we want.
> >>>>>What vulkan
> >>>>>      >>>>really wants
> >>>>>      >>>>   is one
> >>>>>      >>>>   >     of two things:
> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
> >>>>>      happen in
> >>>>>      >>>>   whatever
> >>>>>      >>>>   >     their dependencies are resolved and we
> >>>>>ensure ordering
> >>>>>      >>>>ourselves
> >>>>>      >>>>   by
> >>>>>      >>>>   >     having a syncobj in the VkQueue.
> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> >>>>>queues.  We
> >>>>>      need at
> >>>>>      >>>>   least 2
> >>>>>      >>>>   >     but I don't see why there needs to be a
> >>>>>limit besides
> >>>>>      >>>>the limits
> >>>>>      >>>>   the
> >>>>>      >>>>   >     i915 API already has on the number of
> >>>>>engines.  Vulkan
> >>>>>      could
> >>>>>      >>>>   expose
> >>>>>      >>>>   >     multiple sparse binding queues to the
> >>>>>client if it's not
> >>>>>      >>>>   arbitrarily
> >>>>>      >>>>   >     limited.
> >>>>>      >>>>
> >>>>>      >>>>   Thanks Jason, Lionel.
> >>>>>      >>>>
> >>>>>      >>>>   Jason, what are you referring to when you say
> >>>>>"limits the i915
> >>>>>      API
> >>>>>      >>>>   already
> >>>>>      >>>>   has on the number of engines"? I am not sure if
> >>>>>there is such
> >>>>>      an uapi
> >>>>>      >>>>   today.
> >>>>>      >>>>
> >>>>>      >>>> There's a limit of something like 64 total engines
> >>>>>today based on
> >>>>>      the
> >>>>>      >>>> number of bits we can cram into the exec flags in
> >>>>>execbuffer2.  I
> >>>>>      think
> >>>>>      >>>> someone had an extended version that allowed more
> >>>>>but I ripped it
> >>>>>      out
> >>>>>      >>>> because no one was using it.  Of course,
> >>>>>execbuffer3 might not
> >>>>>      >>>>have that
> >>>>>      >>>> problem at all.
> >>>>>      >>>>
> >>>>>      >>>
> >>>>>      >>>Thanks Jason.
> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> >>>>>execbuffer3
> >>>>>      probably
> >>>>>      >>>will not have this limiation. So, we need to define a
> >>>>>      VM_BIND_MAX_QUEUE
> >>>>>      >>>and somehow export it to user (I am thinking of
> >>>>>embedding it in
> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
> >>>>>      meaning 2^n
> >>>>>      >>>queues.
> >>>>>      >>
> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> >>>>>(0x3f) which
> >>>>>      execbuf3
> >>>>>
> >>>>>    Yup!  That's exactly the limit I was talking about.
> >>>>>
> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
> >>>>>      structures,
> >>>>>      >>
> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> >>>>>      >>        __u32 queue;
> >>>>>      >>
> >>>>>      >>I think that will keep things simple.
> >>>>>      >
> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
> >>>>>      >hardware can have? I suggest not to do that.
> >>>>>      >
> >>>>>      >Change with added this:
> >>>>>      >
> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >>>>>      >               return -EINVAL;
> >>>>>      >
> >>>>>      >To context creation needs to be undone and so let users
> >>>>>create engine
> >>>>>      >maps with all hardware engines, and let execbuf3 access
> >>>>>them all.
> >>>>>      >
> >>>>>
> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> >>>>>execbuff3 also.
> >>>>>      Hence, I was using the same limit for VM_BIND queues
> >>>>>(64, or 65 if we
> >>>>>      make it N+1).
> >>>>>      But, as discussed in other thread of this RFC series, we
> >>>>>are planning
> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> >>>>>      any uapi that limits the number of engines (and hence
> >>>>>the vm_bind
> >>>>>      queues
> >>>>>      need to be supported).
> >>>>>
> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large
> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> >>>>>queue (a wq,
> >>>>>      work_item and a linked list) lookup from the user
> >>>>>specified queue
> >>>>>      index.
> >>>>>      Other option is to just put some hard limit (say 64 or
> >>>>>65) and use
> >>>>>      an array of queues in VM (each created upon first use).
> >>>>>I prefer this.
> >>>>>
> >>>>>    I don't get why a VM_BIND queue is any different from any
> >>>>>other queue or
> >>>>>    userspace-visible kernel object.  But I'll leave those
> >>>>>details up to
> >>>>>    danvet or whoever else might be reviewing the implementation.
> >>>>>    --Jason
> >>>>>
> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> >>>>>queue created
> >>>>>  like the others when we build the engine map?
> >>>>>
> >>>>>  For userspace it's then just matter of selecting the right
> >>>>>queue ID when
> >>>>>  submitting.
> >>>>>
> >>>>>  If there is ever a possibility to have this work on the GPU,
> >>>>>it would be
> >>>>>  all ready.
> >>>>>
> >>>>
> >>>>I did sync offline with Matt Brost on this.
> >>>>We can add a VM_BIND engine class and let user create VM_BIND
> >>>>engines (queues).
> >>>>The problem is, in i915 engine creating interface is bound to
> >>>>gem_context.
> >>>>So, in vm_bind ioctl, we would need both context_id and
> >>>>queue_idx for proper
> >>>>lookup of the user created engine. This is bit ackward as vm_bind is an
> >>>>interface to VM (address space) and has nothing to do with gem_context.
> >>>
> >>>
> >>>A gem_context has a single vm object right?
> >>>
> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
> >>>one if not.
> >>>
> >>>So it's just like picking up the vm like it's done at execbuffer
> >>>time right now : eb->context->vm
> >>>
> >>
> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
> >>VM_BIND/UNBIND
> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
> >>obtained
> >>from the context?
> >
> >
> >Yes, because if we go for engines, they're associated with a context
> >and so also associated with the VM bound to the context.
> >
> 
> Hmm...context doesn't sould like the right interface. It should be
> VM and engine (independent of context). Engine can be virtual or soft
> engine (kernel thread), each with its own queue. We can add an interface
> to create such engines (independent of context). But we are anway
> implicitly creating it when user uses a new queue_idx. If in future
> we have hardware engines for VM_BIND operation, we can have that
> explicit inteface to create engine instances and the queue_index
> in vm_bind/unbind will point to those engines.
> Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm,  create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Regards,
Oak

> 
> Niranjana
> 
> >
> >>I think the interface is clean as a interface to VM. It is only that we
> >>don't have a clean way to create a raw VM_BIND engine (not
> >>associated with
> >>any context) with i915 uapi.
> >>May be we can add such an interface, but I don't think that is worth it
> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
> >>mentioned
> >>above).
> >>Anyone has any thoughts?
> >>
> >>>
> >>>>Another problem is, if two VMs are binding with the same defined
> >>>>engine,
> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
> >>>>(which may be
> >>>>waiting on its in_fence).
> >>>
> >>>
> >>>Maybe I'm missing something, but how can you have 2 vm objects
> >>>with a single gem_context right now?
> >>>
> >>
> >>No, we don't have 2 VMs for a gem_context.
> >>Say if ctx1 with vm1 and ctx2 with vm2.
> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> >>those two queue indicies points to same underlying vm_bind engine,
> >>then the second vm_bind call gets blocked until the first vm_bind call's
> >>'in' fence is triggered and bind completes.
> >>
> >>With per VM queues, this is not a problem as two VMs will not endup
> >>sharing same queue.
> >>
> >>BTW, I just posted a updated PATCH series.
> >>https://www.spinics.net/lists/dri-devel/msg350483.html
> >>
> >>Niranjana
> >>
> >>>
> >>>>
> >>>>So, my preference here is to just add a 'u32 queue' index in
> >>>>vm_bind/unbind
> >>>>ioctl, and the queues are per VM.
> >>>>
> >>>>Niranjana
> >>>>
> >>>>>  Thanks,
> >>>>>
> >>>>>  -Lionel
> >>>>>
> >>>>>
> >>>>>      Niranjana
> >>>>>
> >>>>>      >Regards,
> >>>>>      >
> >>>>>      >Tvrtko
> >>>>>      >
> >>>>>      >>
> >>>>>      >>Niranjana
> >>>>>      >>
> >>>>>      >>>
> >>>>>      >>>>   I am trying to see how many queues we need and
> >>>>>don't want it to
> >>>>>      be
> >>>>>      >>>>   arbitrarily
> >>>>>      >>>>   large and unduely blow up memory usage and
> >>>>>complexity in i915
> >>>>>      driver.
> >>>>>      >>>>
> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> >>>>>vast majority
> >>>>>      >>>>of cases. I
> >>>>>      >>>> could imagine a client wanting to create more than 1 sparse
> >>>>>      >>>>queue in which
> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> >>>>>complexity
> >>>>>      >>>>goes, once
> >>>>>      >>>> you allow two, I don't think the complexity is going up by
> >>>>>      >>>>allowing N.  As
> >>>>>      >>>> for memory usage, creating more queues means more
> >>>>>memory.  That's
> >>>>>      a
> >>>>>      >>>> trade-off that userspace can make. Again, the
> >>>>>expected number
> >>>>>      >>>>here is 1
> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
> >>>>>you need to
> >>>>>      worry.
> >>>>>      >>>
> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> >>>>>      >>>That would require us create 8 workqueues.
> >>>>>      >>>We can change 'n' later if required.
> >>>>>      >>>
> >>>>>      >>>Niranjana
> >>>>>      >>>
> >>>>>      >>>>
> >>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
> >>>>>      >>>>operations and we
> >>>>>      >>>>   don't
> >>>>>      >>>>   >     want any dependencies between them:
> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
> >>>>>creation or
> >>>>>      >>>>maybe as
> >>>>>      >>>>   part of
> >>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
> >>>>>      >>>>don't happen
> >>>>>      >>>>   on a
> >>>>>      >>>>   >     queue and we don't want them serialized
> >>>>>with anything.       To
> >>>>>      >>>>   synchronize
> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
> >>>>>VkDevice which
> >>>>>      is
> >>>>>      >>>>   signaled by
> >>>>>      >>>>   >     all immediate bind operations and make
> >>>>>submits wait on
> >>>>>      it.
> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
> >>>>>VkQueue which may
> >>>>>      be the
> >>>>>      >>>>   same as
> >>>>>      >>>>   >     a render/compute queue or may be its own
> >>>>>queue.  It's up
> >>>>>      to us
> >>>>>      >>>>   what we
> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
> >>>>>PoV, this is like
> >>>>>      any
> >>>>>      >>>>   other
> >>>>>      >>>>   >     queue.  Operations on it wait on and signal
> >>>>>semaphores.       If we
> >>>>>      >>>>   have a
> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
> >>>>>      >>>>signal just like
> >>>>>      >>>>   we do
> >>>>>      >>>>   >     in execbuf().
> >>>>>      >>>>   >     The important thing is that we don't want
> >>>>>one type of
> >>>>>      >>>>operation to
> >>>>>      >>>>   block
> >>>>>      >>>>   >     on the other.  If immediate binds are
> >>>>>blocking on sparse
> >>>>>      binds,
> >>>>>      >>>>   it's
> >>>>>      >>>>   >     going to cause over-synchronization issues.
> >>>>>      >>>>   >     In terms of the internal implementation, I
> >>>>>know that
> >>>>>      >>>>there's going
> >>>>>      >>>>   to be
> >>>>>      >>>>   >     a lock on the VM and that we can't actually
> >>>>>do these
> >>>>>      things in
> >>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
> >>>>>      signaled and
> >>>>>      >>>>   we're
> >>>>>      >>>>
> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
> >>>>>      >>>>multiple queues
> >>>>>      >>>>   feeding to it.
> >>>>>      >>>>
> >>>>>      >>>> Right.  As long as the queues themselves are
> >>>>>independent and
> >>>>>      >>>>can block on
> >>>>>      >>>> dma_fences without holding up other queues, I think
> >>>>>we're fine.
> >>>>>      >>>>
> >>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
> >>>>>      >>>>there's a bit
> >>>>>      >>>>   of
> >>>>>      >>>>   >     synchronization due to locking.  That's
> >>>>>expected.  What
> >>>>>      >>>>we can't
> >>>>>      >>>>   afford
> >>>>>      >>>>   >     to have is an immediate bind operation
> >>>>>suddenly blocking
> >>>>>      on a
> >>>>>      >>>>   sparse
> >>>>>      >>>>   >     operation which is blocked on a compute job
> >>>>>that's going
> >>>>>      to run
> >>>>>      >>>>   for
> >>>>>      >>>>   >     another 5ms.
> >>>>>      >>>>
> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
> >>>>>doesn't block
> >>>>>      the
> >>>>>      >>>>   VM_BIND
> >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
> >>>>>      wanted to
> >>>>>      >>>>   clarify.
> >>>>>      >>>>
> >>>>>      >>>> Yes, that's what I would expect.
> >>>>>      >>>> --Jason
> >>>>>      >>>>
> >>>>>      >>>>   Niranjana
> >>>>>      >>>>
> >>>>>      >>>>   >     For reference, Windows solves this by allowing
> >>>>>      arbitrarily many
> >>>>>      >>>>   paging
> >>>>>      >>>>   >     queues (what they call a VM_BIND
> >>>>>engine/queue).  That
> >>>>>      >>>>design works
> >>>>>      >>>>   >     pretty well and solves the problems in
> >>>>>question.       >>>>Again, we could
> >>>>>      >>>>   just
> >>>>>      >>>>   >     make everything out-of-order and require
> >>>>>using syncobjs
> >>>>>      >>>>to order
> >>>>>      >>>>   things
> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
> >>>>>      >>>>   >     One more note while I'm here: danvet said
> >>>>>something on
> >>>>>      >>>>IRC about
> >>>>>      >>>>   VM_BIND
> >>>>>      >>>>   >     queues waiting for syncobjs to
> >>>>>materialize.  We don't
> >>>>>      really
> >>>>>      >>>>   want/need
> >>>>>      >>>>   >     this.  We already have all the machinery in
> >>>>>userspace to
> >>>>>      handle
> >>>>>      >>>>   >     wait-before-signal and waiting for syncobj
> >>>>>fences to
> >>>>>      >>>>materialize
> >>>>>      >>>>   and
> >>>>>      >>>>   >     that machinery is on by default.  It would actually
> >>>>>      >>>>take MORE work
> >>>>>      >>>>   in
> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
> >>>>>the kernel
> >>>>>      >>>>being able to
> >>>>>      >>>>   wait
> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> >>>>>that right is
> >>>>>      >>>>   ridiculously
> >>>>>      >>>>   >     hard and I really don't want to get it
> >>>>>wrong in kernel
> >>>>>      >>>>space.   �� When we
> >>>>>      >>>>   >     do memory fences, wait-before-signal will
> >>>>>be a thing.  We
> >>>>>      don't
> >>>>>      >>>>   need to
> >>>>>      >>>>   >     try and make it a thing for syncobj.
> >>>>>      >>>>   >     --Jason
> >>>>>      >>>>   >
> >>>>>      >>>>   >   Thanks Jason,
> >>>>>      >>>>   >
> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> >>>>>we're allowed to
> >>>>>      have a
> >>>>>      >>>>   sparse
> >>>>>      >>>>   >   queue that does not implement either graphics
> >>>>>or compute
> >>>>>      >>>>operations
> >>>>>      >>>>   :
> >>>>>      >>>>   >
> >>>>>      >>>>   >     "While some implementations may include
> >>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
> >>>>>      >>>>   >     support in queue families that also include
> >>>>>      >>>>   >
> >>>>>      >>>>   >      graphics and compute support, other
> >>>>>implementations may
> >>>>>      only
> >>>>>      >>>>   expose a
> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>>      >>>>   >
> >>>>>      >>>>   >      family."
> >>>>>      >>>>   >
> >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
> >>>>>      bind/unbind
> >>>>>      >>>>   >   operations.
> >>>>>      >>>>   >
> >>>>>      >>>>   >   But yes we need another engine for the
> >>>>>immediate/non-sparse
> >>>>>      >>>>   operations.
> >>>>>      >>>>   >
> >>>>>      >>>>   >   -Lionel
> >>>>>      >>>>   >
> >>>>>      >>>>   >         >
> >>>>>      >>>>   >       Daniel, any thoughts?
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Niranjana
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >Matt
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Sorry I noticed this late.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> -Lionel
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>
> >>>
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-13 13:33                                         ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-13 13:33 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, Landwerlin, Lionel G
  Cc: Intel GFX, Wilson, Chris P, Hellstrom, Thomas,
	Maling list - DRI developers, Vetter, Daniel,
	Christian König



Regards,
Oak

> -----Original Message-----
> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Niranjana
> Vishwanathapura
> Sent: June 10, 2022 1:43 PM
> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Hellstrom, Thomas <thomas.hellstrom@intel.com>;
> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>
> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>    <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>
> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >>>>>      >
> >>>>>      >
> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>Vishwanathapura
> >>>>>      wrote:
> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>Ekstrand wrote:
> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>      >>>>
> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>Landwerlin
> >>>>>      wrote:
> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>      >>>>   >
> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>Vishwanathapura
> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>      >>>>   >
> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>>      >>>>Brost wrote:
> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
> >>>>>      Landwerlin
> >>>>>      >>>>   wrote:
> >>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>      wrote:
> >>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>      >>>>   binding/unbinding
> >>>>>      >>>>   >       the mapping in an
> >>>>>      >>>>   >       >> > +async worker. The binding and
> >>>>>unbinding will
> >>>>>      >>>>work like a
> >>>>>      >>>>   special
> >>>>>      >>>>   >       GPU engine.
> >>>>>      >>>>   >       >> > +The binding and unbinding operations are
> >>>>>      serialized and
> >>>>>      >>>>   will
> >>>>>      >>>>   >       wait on specified
> >>>>>      >>>>   >       >> > +input fences before the operation
> >>>>>and will signal
> >>>>>      the
> >>>>>      >>>>   output
> >>>>>      >>>>   >       fences upon the
> >>>>>      >>>>   >       >> > +completion of the operation. Due to
> >>>>>      serialization,
> >>>>>      >>>>   completion of
> >>>>>      >>>>   >       an operation
> >>>>>      >>>>   >       >> > +will also indicate that all
> >>>>>previous operations
> >>>>>      >>>>are also
> >>>>>      >>>>   >       complete.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I guess we should avoid saying "will
> >>>>>immediately
> >>>>>      start
> >>>>>      >>>>   >       binding/unbinding" if
> >>>>>      >>>>   >       >> there are fences involved.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> And the fact that it's happening in an async
> >>>>>      >>>>worker seem to
> >>>>>      >>>>   imply
> >>>>>      >>>>   >       it's not
> >>>>>      >>>>   >       >> immediate.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Ok, will fix.
> >>>>>      >>>>   >       This was added because in earlier design
> >>>>>binding was
> >>>>>      deferred
> >>>>>      >>>>   until
> >>>>>      >>>>   >       next execbuff.
> >>>>>      >>>>   >       But now it is non-deferred (immediate in
> >>>>>that sense).
> >>>>>      >>>>But yah,
> >>>>>      >>>>   this is
> >>>>>      >>>>   >       confusing
> >>>>>      >>>>   >       and will fix it.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I have a question on the behavior of the bind
> >>>>>      >>>>operation when
> >>>>>      >>>>   no
> >>>>>      >>>>   >       input fence
> >>>>>      >>>>   >       >> is provided. Let say I do :
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In what order are the fences going to
> >>>>>be signaled?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out
> >>>>>of order?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Because you wrote "serialized I assume
> >>>>>it's : in
> >>>>>      order
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> >>>>>ioctls. Note that
> >>>>>      >>>>bind and
> >>>>>      >>>>   unbind
> >>>>>      >>>>   >       will use
> >>>>>      >>>>   >       the same queue and hence are ordered.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> One thing I didn't realize is that
> >>>>>because we only
> >>>>>      get one
> >>>>>      >>>>   >       "VM_BIND" engine,
> >>>>>      >>>>   >       >> there is a disconnect from the Vulkan
> >>>>>specification.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In Vulkan VM_BIND operations are
> >>>>>serialized but
> >>>>>      >>>>per engine.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> So you could have something like this :
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
> >>>>>      out_fence=fence2)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
> >>>>>      out_fence=fence4)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> fence1 is not signaled
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> fence3 is signaled
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
> >>>>>      >>>>first VM_BIND.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I guess we can deal with that scenario in
> >>>>>      >>>>userspace by doing
> >>>>>      >>>>   the
> >>>>>      >>>>   >       wait
> >>>>>      >>>>   >       >> ourselves in one thread per engines.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> But then it makes the VM_BIND input
> >>>>>fences useless.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Daniel : what do you think? Should be
> >>>>>rework this or
> >>>>>      just
> >>>>>      >>>>   deal with
> >>>>>      >>>>   >       wait
> >>>>>      >>>>   >       >> fences in userspace?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >My opinion is rework this but make the
> >>>>>ordering via
> >>>>>      >>>>an engine
> >>>>>      >>>>   param
> >>>>>      >>>>   >       optional.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
> >>>>>are ordered
> >>>>>      >>>>within the
> >>>>>      >>>>   VM
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
> >>>>>accept an
> >>>>>      engine
> >>>>>      >>>>   argument
> >>>>>      >>>>   >       (in
> >>>>>      >>>>   >       >the case of the i915 likely this is a
> >>>>>gem context
> >>>>>      >>>>handle) and
> >>>>>      >>>>   binds
> >>>>>      >>>>   >       >ordered with respect to that engine.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >This gives UMDs options as the later
> >>>>>likely consumes
> >>>>>      >>>>more KMD
> >>>>>      >>>>   >       resources
> >>>>>      >>>>   >       >so if a different UMD can live with binds being
> >>>>>      >>>>ordered within
> >>>>>      >>>>   the VM
> >>>>>      >>>>   >       >they can use a mode consuming less resources.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >
> >>>>>      >>>>   >       I think we need to be careful here if we
> >>>>>are looking
> >>>>>      for some
> >>>>>      >>>>   out of
> >>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
> >>>>>      >>>>   >       In-order completion means, in a batch of
> >>>>>binds and
> >>>>>      >>>>unbinds to be
> >>>>>      >>>>   >       completed in-order, user only needs to specify
> >>>>>      >>>>in-fence for the
> >>>>>      >>>>   >       first bind/unbind call and the our-fence
> >>>>>for the last
> >>>>>      >>>>   bind/unbind
> >>>>>      >>>>   >       call. Also, the VA released by an unbind
> >>>>>call can be
> >>>>>      >>>>re-used by
> >>>>>      >>>>   >       any subsequent bind call in that in-order batch.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       These things will break if
> >>>>>binding/unbinding were to
> >>>>>      >>>>be allowed
> >>>>>      >>>>   to
> >>>>>      >>>>   >       go out of order (of submission) and user
> >>>>>need to be
> >>>>>      extra
> >>>>>      >>>>   careful
> >>>>>      >>>>   >       not to run into pre-mature triggereing of
> >>>>>out-fence and
> >>>>>      bind
> >>>>>      >>>>   failing
> >>>>>      >>>>   >       as VA is still in use etc.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
> >>>>>      specified
> >>>>>      >>>>   address
> >>>>>      >>>>   >       space
> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> >>>>>specific.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
> >>>>>which can be
> >>>>>      >>>>one from
> >>>>>      >>>>   the
> >>>>>      >>>>   >       pre-defined queues,
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
> >>>>>      >>>>   >       ...
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
> >>>>>      >>>>   >
> >>>>>      >>>>   >       KMD will spawn an async work queue for
> >>>>>each queue which
> >>>>>      will
> >>>>>      >>>>   only
> >>>>>      >>>>   >       bind the mappings on that queue in the order of
> >>>>>      submission.
> >>>>>      >>>>   >       User can assign the queue to per engine
> >>>>>or anything
> >>>>>      >>>>like that.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       But again here, user need to be careful and not
> >>>>>      >>>>deadlock these
> >>>>>      >>>>   >       queues with circular dependency of fences.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       I prefer adding this later an as
> >>>>>extension based on
> >>>>>      >>>>whether it
> >>>>>      >>>>   >       is really helping with the implementation.
> >>>>>      >>>>   >
> >>>>>      >>>>   >     I can tell you right now that having
> >>>>>everything on a
> >>>>>      single
> >>>>>      >>>>   in-order
> >>>>>      >>>>   >     queue will not get us the perf we want.
> >>>>>What vulkan
> >>>>>      >>>>really wants
> >>>>>      >>>>   is one
> >>>>>      >>>>   >     of two things:
> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
> >>>>>      happen in
> >>>>>      >>>>   whatever
> >>>>>      >>>>   >     their dependencies are resolved and we
> >>>>>ensure ordering
> >>>>>      >>>>ourselves
> >>>>>      >>>>   by
> >>>>>      >>>>   >     having a syncobj in the VkQueue.
> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> >>>>>queues.  We
> >>>>>      need at
> >>>>>      >>>>   least 2
> >>>>>      >>>>   >     but I don't see why there needs to be a
> >>>>>limit besides
> >>>>>      >>>>the limits
> >>>>>      >>>>   the
> >>>>>      >>>>   >     i915 API already has on the number of
> >>>>>engines.  Vulkan
> >>>>>      could
> >>>>>      >>>>   expose
> >>>>>      >>>>   >     multiple sparse binding queues to the
> >>>>>client if it's not
> >>>>>      >>>>   arbitrarily
> >>>>>      >>>>   >     limited.
> >>>>>      >>>>
> >>>>>      >>>>   Thanks Jason, Lionel.
> >>>>>      >>>>
> >>>>>      >>>>   Jason, what are you referring to when you say
> >>>>>"limits the i915
> >>>>>      API
> >>>>>      >>>>   already
> >>>>>      >>>>   has on the number of engines"? I am not sure if
> >>>>>there is such
> >>>>>      an uapi
> >>>>>      >>>>   today.
> >>>>>      >>>>
> >>>>>      >>>> There's a limit of something like 64 total engines
> >>>>>today based on
> >>>>>      the
> >>>>>      >>>> number of bits we can cram into the exec flags in
> >>>>>execbuffer2.  I
> >>>>>      think
> >>>>>      >>>> someone had an extended version that allowed more
> >>>>>but I ripped it
> >>>>>      out
> >>>>>      >>>> because no one was using it.  Of course,
> >>>>>execbuffer3 might not
> >>>>>      >>>>have that
> >>>>>      >>>> problem at all.
> >>>>>      >>>>
> >>>>>      >>>
> >>>>>      >>>Thanks Jason.
> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> >>>>>execbuffer3
> >>>>>      probably
> >>>>>      >>>will not have this limiation. So, we need to define a
> >>>>>      VM_BIND_MAX_QUEUE
> >>>>>      >>>and somehow export it to user (I am thinking of
> >>>>>embedding it in
> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
> >>>>>      meaning 2^n
> >>>>>      >>>queues.
> >>>>>      >>
> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> >>>>>(0x3f) which
> >>>>>      execbuf3
> >>>>>
> >>>>>    Yup!  That's exactly the limit I was talking about.
> >>>>>
> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
> >>>>>      structures,
> >>>>>      >>
> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> >>>>>      >>        __u32 queue;
> >>>>>      >>
> >>>>>      >>I think that will keep things simple.
> >>>>>      >
> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
> >>>>>      >hardware can have? I suggest not to do that.
> >>>>>      >
> >>>>>      >Change with added this:
> >>>>>      >
> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >>>>>      >               return -EINVAL;
> >>>>>      >
> >>>>>      >To context creation needs to be undone and so let users
> >>>>>create engine
> >>>>>      >maps with all hardware engines, and let execbuf3 access
> >>>>>them all.
> >>>>>      >
> >>>>>
> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> >>>>>execbuff3 also.
> >>>>>      Hence, I was using the same limit for VM_BIND queues
> >>>>>(64, or 65 if we
> >>>>>      make it N+1).
> >>>>>      But, as discussed in other thread of this RFC series, we
> >>>>>are planning
> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> >>>>>      any uapi that limits the number of engines (and hence
> >>>>>the vm_bind
> >>>>>      queues
> >>>>>      need to be supported).
> >>>>>
> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large
> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> >>>>>queue (a wq,
> >>>>>      work_item and a linked list) lookup from the user
> >>>>>specified queue
> >>>>>      index.
> >>>>>      Other option is to just put some hard limit (say 64 or
> >>>>>65) and use
> >>>>>      an array of queues in VM (each created upon first use).
> >>>>>I prefer this.
> >>>>>
> >>>>>    I don't get why a VM_BIND queue is any different from any
> >>>>>other queue or
> >>>>>    userspace-visible kernel object.  But I'll leave those
> >>>>>details up to
> >>>>>    danvet or whoever else might be reviewing the implementation.
> >>>>>    --Jason
> >>>>>
> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> >>>>>queue created
> >>>>>  like the others when we build the engine map?
> >>>>>
> >>>>>  For userspace it's then just matter of selecting the right
> >>>>>queue ID when
> >>>>>  submitting.
> >>>>>
> >>>>>  If there is ever a possibility to have this work on the GPU,
> >>>>>it would be
> >>>>>  all ready.
> >>>>>
> >>>>
> >>>>I did sync offline with Matt Brost on this.
> >>>>We can add a VM_BIND engine class and let user create VM_BIND
> >>>>engines (queues).
> >>>>The problem is, in i915 engine creating interface is bound to
> >>>>gem_context.
> >>>>So, in vm_bind ioctl, we would need both context_id and
> >>>>queue_idx for proper
> >>>>lookup of the user created engine. This is bit ackward as vm_bind is an
> >>>>interface to VM (address space) and has nothing to do with gem_context.
> >>>
> >>>
> >>>A gem_context has a single vm object right?
> >>>
> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
> >>>one if not.
> >>>
> >>>So it's just like picking up the vm like it's done at execbuffer
> >>>time right now : eb->context->vm
> >>>
> >>
> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
> >>VM_BIND/UNBIND
> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
> >>obtained
> >>from the context?
> >
> >
> >Yes, because if we go for engines, they're associated with a context
> >and so also associated with the VM bound to the context.
> >
> 
> Hmm...context doesn't sould like the right interface. It should be
> VM and engine (independent of context). Engine can be virtual or soft
> engine (kernel thread), each with its own queue. We can add an interface
> to create such engines (independent of context). But we are anway
> implicitly creating it when user uses a new queue_idx. If in future
> we have hardware engines for VM_BIND operation, we can have that
> explicit inteface to create engine instances and the queue_index
> in vm_bind/unbind will point to those engines.
> Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm,  create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Regards,
Oak

> 
> Niranjana
> 
> >
> >>I think the interface is clean as a interface to VM. It is only that we
> >>don't have a clean way to create a raw VM_BIND engine (not
> >>associated with
> >>any context) with i915 uapi.
> >>May be we can add such an interface, but I don't think that is worth it
> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
> >>mentioned
> >>above).
> >>Anyone has any thoughts?
> >>
> >>>
> >>>>Another problem is, if two VMs are binding with the same defined
> >>>>engine,
> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
> >>>>(which may be
> >>>>waiting on its in_fence).
> >>>
> >>>
> >>>Maybe I'm missing something, but how can you have 2 vm objects
> >>>with a single gem_context right now?
> >>>
> >>
> >>No, we don't have 2 VMs for a gem_context.
> >>Say if ctx1 with vm1 and ctx2 with vm2.
> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> >>those two queue indicies points to same underlying vm_bind engine,
> >>then the second vm_bind call gets blocked until the first vm_bind call's
> >>'in' fence is triggered and bind completes.
> >>
> >>With per VM queues, this is not a problem as two VMs will not endup
> >>sharing same queue.
> >>
> >>BTW, I just posted a updated PATCH series.
> >>https://www.spinics.net/lists/dri-devel/msg350483.html
> >>
> >>Niranjana
> >>
> >>>
> >>>>
> >>>>So, my preference here is to just add a 'u32 queue' index in
> >>>>vm_bind/unbind
> >>>>ioctl, and the queues are per VM.
> >>>>
> >>>>Niranjana
> >>>>
> >>>>>  Thanks,
> >>>>>
> >>>>>  -Lionel
> >>>>>
> >>>>>
> >>>>>      Niranjana
> >>>>>
> >>>>>      >Regards,
> >>>>>      >
> >>>>>      >Tvrtko
> >>>>>      >
> >>>>>      >>
> >>>>>      >>Niranjana
> >>>>>      >>
> >>>>>      >>>
> >>>>>      >>>>   I am trying to see how many queues we need and
> >>>>>don't want it to
> >>>>>      be
> >>>>>      >>>>   arbitrarily
> >>>>>      >>>>   large and unduely blow up memory usage and
> >>>>>complexity in i915
> >>>>>      driver.
> >>>>>      >>>>
> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> >>>>>vast majority
> >>>>>      >>>>of cases. I
> >>>>>      >>>> could imagine a client wanting to create more than 1 sparse
> >>>>>      >>>>queue in which
> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> >>>>>complexity
> >>>>>      >>>>goes, once
> >>>>>      >>>> you allow two, I don't think the complexity is going up by
> >>>>>      >>>>allowing N.  As
> >>>>>      >>>> for memory usage, creating more queues means more
> >>>>>memory.  That's
> >>>>>      a
> >>>>>      >>>> trade-off that userspace can make. Again, the
> >>>>>expected number
> >>>>>      >>>>here is 1
> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
> >>>>>you need to
> >>>>>      worry.
> >>>>>      >>>
> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> >>>>>      >>>That would require us create 8 workqueues.
> >>>>>      >>>We can change 'n' later if required.
> >>>>>      >>>
> >>>>>      >>>Niranjana
> >>>>>      >>>
> >>>>>      >>>>
> >>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
> >>>>>      >>>>operations and we
> >>>>>      >>>>   don't
> >>>>>      >>>>   >     want any dependencies between them:
> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
> >>>>>creation or
> >>>>>      >>>>maybe as
> >>>>>      >>>>   part of
> >>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
> >>>>>      >>>>don't happen
> >>>>>      >>>>   on a
> >>>>>      >>>>   >     queue and we don't want them serialized
> >>>>>with anything.       To
> >>>>>      >>>>   synchronize
> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
> >>>>>VkDevice which
> >>>>>      is
> >>>>>      >>>>   signaled by
> >>>>>      >>>>   >     all immediate bind operations and make
> >>>>>submits wait on
> >>>>>      it.
> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
> >>>>>VkQueue which may
> >>>>>      be the
> >>>>>      >>>>   same as
> >>>>>      >>>>   >     a render/compute queue or may be its own
> >>>>>queue.  It's up
> >>>>>      to us
> >>>>>      >>>>   what we
> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
> >>>>>PoV, this is like
> >>>>>      any
> >>>>>      >>>>   other
> >>>>>      >>>>   >     queue.  Operations on it wait on and signal
> >>>>>semaphores.       If we
> >>>>>      >>>>   have a
> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
> >>>>>      >>>>signal just like
> >>>>>      >>>>   we do
> >>>>>      >>>>   >     in execbuf().
> >>>>>      >>>>   >     The important thing is that we don't want
> >>>>>one type of
> >>>>>      >>>>operation to
> >>>>>      >>>>   block
> >>>>>      >>>>   >     on the other.  If immediate binds are
> >>>>>blocking on sparse
> >>>>>      binds,
> >>>>>      >>>>   it's
> >>>>>      >>>>   >     going to cause over-synchronization issues.
> >>>>>      >>>>   >     In terms of the internal implementation, I
> >>>>>know that
> >>>>>      >>>>there's going
> >>>>>      >>>>   to be
> >>>>>      >>>>   >     a lock on the VM and that we can't actually
> >>>>>do these
> >>>>>      things in
> >>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
> >>>>>      signaled and
> >>>>>      >>>>   we're
> >>>>>      >>>>
> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
> >>>>>      >>>>multiple queues
> >>>>>      >>>>   feeding to it.
> >>>>>      >>>>
> >>>>>      >>>> Right.  As long as the queues themselves are
> >>>>>independent and
> >>>>>      >>>>can block on
> >>>>>      >>>> dma_fences without holding up other queues, I think
> >>>>>we're fine.
> >>>>>      >>>>
> >>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
> >>>>>      >>>>there's a bit
> >>>>>      >>>>   of
> >>>>>      >>>>   >     synchronization due to locking.  That's
> >>>>>expected.  What
> >>>>>      >>>>we can't
> >>>>>      >>>>   afford
> >>>>>      >>>>   >     to have is an immediate bind operation
> >>>>>suddenly blocking
> >>>>>      on a
> >>>>>      >>>>   sparse
> >>>>>      >>>>   >     operation which is blocked on a compute job
> >>>>>that's going
> >>>>>      to run
> >>>>>      >>>>   for
> >>>>>      >>>>   >     another 5ms.
> >>>>>      >>>>
> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
> >>>>>doesn't block
> >>>>>      the
> >>>>>      >>>>   VM_BIND
> >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
> >>>>>      wanted to
> >>>>>      >>>>   clarify.
> >>>>>      >>>>
> >>>>>      >>>> Yes, that's what I would expect.
> >>>>>      >>>> --Jason
> >>>>>      >>>>
> >>>>>      >>>>   Niranjana
> >>>>>      >>>>
> >>>>>      >>>>   >     For reference, Windows solves this by allowing
> >>>>>      arbitrarily many
> >>>>>      >>>>   paging
> >>>>>      >>>>   >     queues (what they call a VM_BIND
> >>>>>engine/queue).  That
> >>>>>      >>>>design works
> >>>>>      >>>>   >     pretty well and solves the problems in
> >>>>>question.       >>>>Again, we could
> >>>>>      >>>>   just
> >>>>>      >>>>   >     make everything out-of-order and require
> >>>>>using syncobjs
> >>>>>      >>>>to order
> >>>>>      >>>>   things
> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
> >>>>>      >>>>   >     One more note while I'm here: danvet said
> >>>>>something on
> >>>>>      >>>>IRC about
> >>>>>      >>>>   VM_BIND
> >>>>>      >>>>   >     queues waiting for syncobjs to
> >>>>>materialize.  We don't
> >>>>>      really
> >>>>>      >>>>   want/need
> >>>>>      >>>>   >     this.  We already have all the machinery in
> >>>>>userspace to
> >>>>>      handle
> >>>>>      >>>>   >     wait-before-signal and waiting for syncobj
> >>>>>fences to
> >>>>>      >>>>materialize
> >>>>>      >>>>   and
> >>>>>      >>>>   >     that machinery is on by default.  It would actually
> >>>>>      >>>>take MORE work
> >>>>>      >>>>   in
> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
> >>>>>the kernel
> >>>>>      >>>>being able to
> >>>>>      >>>>   wait
> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> >>>>>that right is
> >>>>>      >>>>   ridiculously
> >>>>>      >>>>   >     hard and I really don't want to get it
> >>>>>wrong in kernel
> >>>>>      >>>>space.   �� When we
> >>>>>      >>>>   >     do memory fences, wait-before-signal will
> >>>>>be a thing.  We
> >>>>>      don't
> >>>>>      >>>>   need to
> >>>>>      >>>>   >     try and make it a thing for syncobj.
> >>>>>      >>>>   >     --Jason
> >>>>>      >>>>   >
> >>>>>      >>>>   >   Thanks Jason,
> >>>>>      >>>>   >
> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> >>>>>we're allowed to
> >>>>>      have a
> >>>>>      >>>>   sparse
> >>>>>      >>>>   >   queue that does not implement either graphics
> >>>>>or compute
> >>>>>      >>>>operations
> >>>>>      >>>>   :
> >>>>>      >>>>   >
> >>>>>      >>>>   >     "While some implementations may include
> >>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
> >>>>>      >>>>   >     support in queue families that also include
> >>>>>      >>>>   >
> >>>>>      >>>>   >      graphics and compute support, other
> >>>>>implementations may
> >>>>>      only
> >>>>>      >>>>   expose a
> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>>      >>>>   >
> >>>>>      >>>>   >      family."
> >>>>>      >>>>   >
> >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
> >>>>>      bind/unbind
> >>>>>      >>>>   >   operations.
> >>>>>      >>>>   >
> >>>>>      >>>>   >   But yes we need another engine for the
> >>>>>immediate/non-sparse
> >>>>>      >>>>   operations.
> >>>>>      >>>>   >
> >>>>>      >>>>   >   -Lionel
> >>>>>      >>>>   >
> >>>>>      >>>>   >         >
> >>>>>      >>>>   >       Daniel, any thoughts?
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Niranjana
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >Matt
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Sorry I noticed this late.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> -Lionel
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>
> >>>
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-13 13:33                                         ` Zeng, Oak
@ 2022-06-13 18:02                                           ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-13 18:02 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Wilson, Chris P, Intel GFX, Maling list - DRI developers,
	Hellstrom, Thomas, Landwerlin, Lionel G, Vetter, Daniel,
	Christian König

On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Niranjana
>> Vishwanathapura
>> Sent: June 10, 2022 1:43 PM
>> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
>> devel@lists.freedesktop.org>; Hellstrom, Thomas <thomas.hellstrom@intel.com>;
>> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
>> document
>>
>> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>> >>>>>
>> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>> >>>>>    <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>
>> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>> >>>>>      >
>> >>>>>      >
>> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>> >>>>>Vishwanathapura
>> >>>>>      wrote:
>> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>> >>>>>Ekstrand wrote:
>> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>      >>>>
>> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>> >>>>>Landwerlin
>> >>>>>      wrote:
>> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>> >>>>>Vishwanathapura
>> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>> >>>>>      >>>>Brost wrote:
>> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>> >>>>>      Landwerlin
>> >>>>>      >>>>   wrote:
>> >>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>> >>>>>      wrote:
>> >>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>> >>>>>      >>>>   binding/unbinding
>> >>>>>      >>>>   >       the mapping in an
>> >>>>>      >>>>   >       >> > +async worker. The binding and
>> >>>>>unbinding will
>> >>>>>      >>>>work like a
>> >>>>>      >>>>   special
>> >>>>>      >>>>   >       GPU engine.
>> >>>>>      >>>>   >       >> > +The binding and unbinding operations are
>> >>>>>      serialized and
>> >>>>>      >>>>   will
>> >>>>>      >>>>   >       wait on specified
>> >>>>>      >>>>   >       >> > +input fences before the operation
>> >>>>>and will signal
>> >>>>>      the
>> >>>>>      >>>>   output
>> >>>>>      >>>>   >       fences upon the
>> >>>>>      >>>>   >       >> > +completion of the operation. Due to
>> >>>>>      serialization,
>> >>>>>      >>>>   completion of
>> >>>>>      >>>>   >       an operation
>> >>>>>      >>>>   >       >> > +will also indicate that all
>> >>>>>previous operations
>> >>>>>      >>>>are also
>> >>>>>      >>>>   >       complete.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I guess we should avoid saying "will
>> >>>>>immediately
>> >>>>>      start
>> >>>>>      >>>>   >       binding/unbinding" if
>> >>>>>      >>>>   >       >> there are fences involved.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> And the fact that it's happening in an async
>> >>>>>      >>>>worker seem to
>> >>>>>      >>>>   imply
>> >>>>>      >>>>   >       it's not
>> >>>>>      >>>>   >       >> immediate.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Ok, will fix.
>> >>>>>      >>>>   >       This was added because in earlier design
>> >>>>>binding was
>> >>>>>      deferred
>> >>>>>      >>>>   until
>> >>>>>      >>>>   >       next execbuff.
>> >>>>>      >>>>   >       But now it is non-deferred (immediate in
>> >>>>>that sense).
>> >>>>>      >>>>But yah,
>> >>>>>      >>>>   this is
>> >>>>>      >>>>   >       confusing
>> >>>>>      >>>>   >       and will fix it.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I have a question on the behavior of the bind
>> >>>>>      >>>>operation when
>> >>>>>      >>>>   no
>> >>>>>      >>>>   >       input fence
>> >>>>>      >>>>   >       >> is provided. Let say I do :
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In what order are the fences going to
>> >>>>>be signaled?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out
>> >>>>>of order?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Because you wrote "serialized I assume
>> >>>>>it's : in
>> >>>>>      order
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>> >>>>>ioctls. Note that
>> >>>>>      >>>>bind and
>> >>>>>      >>>>   unbind
>> >>>>>      >>>>   >       will use
>> >>>>>      >>>>   >       the same queue and hence are ordered.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> One thing I didn't realize is that
>> >>>>>because we only
>> >>>>>      get one
>> >>>>>      >>>>   >       "VM_BIND" engine,
>> >>>>>      >>>>   >       >> there is a disconnect from the Vulkan
>> >>>>>specification.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In Vulkan VM_BIND operations are
>> >>>>>serialized but
>> >>>>>      >>>>per engine.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> So you could have something like this :
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>> >>>>>      out_fence=fence2)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>> >>>>>      out_fence=fence4)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> fence1 is not signaled
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> fence3 is signaled
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>> >>>>>      >>>>first VM_BIND.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I guess we can deal with that scenario in
>> >>>>>      >>>>userspace by doing
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >       wait
>> >>>>>      >>>>   >       >> ourselves in one thread per engines.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> But then it makes the VM_BIND input
>> >>>>>fences useless.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Daniel : what do you think? Should be
>> >>>>>rework this or
>> >>>>>      just
>> >>>>>      >>>>   deal with
>> >>>>>      >>>>   >       wait
>> >>>>>      >>>>   >       >> fences in userspace?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >My opinion is rework this but make the
>> >>>>>ordering via
>> >>>>>      >>>>an engine
>> >>>>>      >>>>   param
>> >>>>>      >>>>   >       optional.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
>> >>>>>are ordered
>> >>>>>      >>>>within the
>> >>>>>      >>>>   VM
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
>> >>>>>accept an
>> >>>>>      engine
>> >>>>>      >>>>   argument
>> >>>>>      >>>>   >       (in
>> >>>>>      >>>>   >       >the case of the i915 likely this is a
>> >>>>>gem context
>> >>>>>      >>>>handle) and
>> >>>>>      >>>>   binds
>> >>>>>      >>>>   >       >ordered with respect to that engine.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >This gives UMDs options as the later
>> >>>>>likely consumes
>> >>>>>      >>>>more KMD
>> >>>>>      >>>>   >       resources
>> >>>>>      >>>>   >       >so if a different UMD can live with binds being
>> >>>>>      >>>>ordered within
>> >>>>>      >>>>   the VM
>> >>>>>      >>>>   >       >they can use a mode consuming less resources.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       I think we need to be careful here if we
>> >>>>>are looking
>> >>>>>      for some
>> >>>>>      >>>>   out of
>> >>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>> >>>>>      >>>>   >       In-order completion means, in a batch of
>> >>>>>binds and
>> >>>>>      >>>>unbinds to be
>> >>>>>      >>>>   >       completed in-order, user only needs to specify
>> >>>>>      >>>>in-fence for the
>> >>>>>      >>>>   >       first bind/unbind call and the our-fence
>> >>>>>for the last
>> >>>>>      >>>>   bind/unbind
>> >>>>>      >>>>   >       call. Also, the VA released by an unbind
>> >>>>>call can be
>> >>>>>      >>>>re-used by
>> >>>>>      >>>>   >       any subsequent bind call in that in-order batch.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       These things will break if
>> >>>>>binding/unbinding were to
>> >>>>>      >>>>be allowed
>> >>>>>      >>>>   to
>> >>>>>      >>>>   >       go out of order (of submission) and user
>> >>>>>need to be
>> >>>>>      extra
>> >>>>>      >>>>   careful
>> >>>>>      >>>>   >       not to run into pre-mature triggereing of
>> >>>>>out-fence and
>> >>>>>      bind
>> >>>>>      >>>>   failing
>> >>>>>      >>>>   >       as VA is still in use etc.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>> >>>>>      specified
>> >>>>>      >>>>   address
>> >>>>>      >>>>   >       space
>> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>> >>>>>specific.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
>> >>>>>which can be
>> >>>>>      >>>>one from
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >       pre-defined queues,
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>> >>>>>      >>>>   >       ...
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       KMD will spawn an async work queue for
>> >>>>>each queue which
>> >>>>>      will
>> >>>>>      >>>>   only
>> >>>>>      >>>>   >       bind the mappings on that queue in the order of
>> >>>>>      submission.
>> >>>>>      >>>>   >       User can assign the queue to per engine
>> >>>>>or anything
>> >>>>>      >>>>like that.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       But again here, user need to be careful and not
>> >>>>>      >>>>deadlock these
>> >>>>>      >>>>   >       queues with circular dependency of fences.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       I prefer adding this later an as
>> >>>>>extension based on
>> >>>>>      >>>>whether it
>> >>>>>      >>>>   >       is really helping with the implementation.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     I can tell you right now that having
>> >>>>>everything on a
>> >>>>>      single
>> >>>>>      >>>>   in-order
>> >>>>>      >>>>   >     queue will not get us the perf we want.
>> >>>>>What vulkan
>> >>>>>      >>>>really wants
>> >>>>>      >>>>   is one
>> >>>>>      >>>>   >     of two things:
>> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>> >>>>>      happen in
>> >>>>>      >>>>   whatever
>> >>>>>      >>>>   >     their dependencies are resolved and we
>> >>>>>ensure ordering
>> >>>>>      >>>>ourselves
>> >>>>>      >>>>   by
>> >>>>>      >>>>   >     having a syncobj in the VkQueue.
>> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>> >>>>>queues.  We
>> >>>>>      need at
>> >>>>>      >>>>   least 2
>> >>>>>      >>>>   >     but I don't see why there needs to be a
>> >>>>>limit besides
>> >>>>>      >>>>the limits
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >     i915 API already has on the number of
>> >>>>>engines.  Vulkan
>> >>>>>      could
>> >>>>>      >>>>   expose
>> >>>>>      >>>>   >     multiple sparse binding queues to the
>> >>>>>client if it's not
>> >>>>>      >>>>   arbitrarily
>> >>>>>      >>>>   >     limited.
>> >>>>>      >>>>
>> >>>>>      >>>>   Thanks Jason, Lionel.
>> >>>>>      >>>>
>> >>>>>      >>>>   Jason, what are you referring to when you say
>> >>>>>"limits the i915
>> >>>>>      API
>> >>>>>      >>>>   already
>> >>>>>      >>>>   has on the number of engines"? I am not sure if
>> >>>>>there is such
>> >>>>>      an uapi
>> >>>>>      >>>>   today.
>> >>>>>      >>>>
>> >>>>>      >>>> There's a limit of something like 64 total engines
>> >>>>>today based on
>> >>>>>      the
>> >>>>>      >>>> number of bits we can cram into the exec flags in
>> >>>>>execbuffer2.  I
>> >>>>>      think
>> >>>>>      >>>> someone had an extended version that allowed more
>> >>>>>but I ripped it
>> >>>>>      out
>> >>>>>      >>>> because no one was using it.  Of course,
>> >>>>>execbuffer3 might not
>> >>>>>      >>>>have that
>> >>>>>      >>>> problem at all.
>> >>>>>      >>>>
>> >>>>>      >>>
>> >>>>>      >>>Thanks Jason.
>> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>> >>>>>execbuffer3
>> >>>>>      probably
>> >>>>>      >>>will not have this limiation. So, we need to define a
>> >>>>>      VM_BIND_MAX_QUEUE
>> >>>>>      >>>and somehow export it to user (I am thinking of
>> >>>>>embedding it in
>> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>> >>>>>      meaning 2^n
>> >>>>>      >>>queues.
>> >>>>>      >>
>> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>> >>>>>(0x3f) which
>> >>>>>      execbuf3
>> >>>>>
>> >>>>>    Yup!  That's exactly the limit I was talking about.
>> >>>>>
>> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>> >>>>>      structures,
>> >>>>>      >>
>> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>> >>>>>      >>        __u32 queue;
>> >>>>>      >>
>> >>>>>      >>I think that will keep things simple.
>> >>>>>      >
>> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>> >>>>>      >hardware can have? I suggest not to do that.
>> >>>>>      >
>> >>>>>      >Change with added this:
>> >>>>>      >
>> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>> >>>>>      >               return -EINVAL;
>> >>>>>      >
>> >>>>>      >To context creation needs to be undone and so let users
>> >>>>>create engine
>> >>>>>      >maps with all hardware engines, and let execbuf3 access
>> >>>>>them all.
>> >>>>>      >
>> >>>>>
>> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>> >>>>>execbuff3 also.
>> >>>>>      Hence, I was using the same limit for VM_BIND queues
>> >>>>>(64, or 65 if we
>> >>>>>      make it N+1).
>> >>>>>      But, as discussed in other thread of this RFC series, we
>> >>>>>are planning
>> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>> >>>>>      any uapi that limits the number of engines (and hence
>> >>>>>the vm_bind
>> >>>>>      queues
>> >>>>>      need to be supported).
>> >>>>>
>> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large
>> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>> >>>>>queue (a wq,
>> >>>>>      work_item and a linked list) lookup from the user
>> >>>>>specified queue
>> >>>>>      index.
>> >>>>>      Other option is to just put some hard limit (say 64 or
>> >>>>>65) and use
>> >>>>>      an array of queues in VM (each created upon first use).
>> >>>>>I prefer this.
>> >>>>>
>> >>>>>    I don't get why a VM_BIND queue is any different from any
>> >>>>>other queue or
>> >>>>>    userspace-visible kernel object.  But I'll leave those
>> >>>>>details up to
>> >>>>>    danvet or whoever else might be reviewing the implementation.
>> >>>>>    --Jason
>> >>>>>
>> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>> >>>>>queue created
>> >>>>>  like the others when we build the engine map?
>> >>>>>
>> >>>>>  For userspace it's then just matter of selecting the right
>> >>>>>queue ID when
>> >>>>>  submitting.
>> >>>>>
>> >>>>>  If there is ever a possibility to have this work on the GPU,
>> >>>>>it would be
>> >>>>>  all ready.
>> >>>>>
>> >>>>
>> >>>>I did sync offline with Matt Brost on this.
>> >>>>We can add a VM_BIND engine class and let user create VM_BIND
>> >>>>engines (queues).
>> >>>>The problem is, in i915 engine creating interface is bound to
>> >>>>gem_context.
>> >>>>So, in vm_bind ioctl, we would need both context_id and
>> >>>>queue_idx for proper
>> >>>>lookup of the user created engine. This is bit ackward as vm_bind is an
>> >>>>interface to VM (address space) and has nothing to do with gem_context.
>> >>>
>> >>>
>> >>>A gem_context has a single vm object right?
>> >>>
>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>> >>>one if not.
>> >>>
>> >>>So it's just like picking up the vm like it's done at execbuffer
>> >>>time right now : eb->context->vm
>> >>>
>> >>
>> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
>> >>VM_BIND/UNBIND
>> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>> >>obtained
>> >>from the context?
>> >
>> >
>> >Yes, because if we go for engines, they're associated with a context
>> >and so also associated with the VM bound to the context.
>> >
>>
>> Hmm...context doesn't sould like the right interface. It should be
>> VM and engine (independent of context). Engine can be virtual or soft
>> engine (kernel thread), each with its own queue. We can add an interface
>> to create such engines (independent of context). But we are anway
>> implicitly creating it when user uses a new queue_idx. If in future
>> we have hardware engines for VM_BIND operation, we can have that
>> explicit inteface to create engine instances and the queue_index
>> in vm_bind/unbind will point to those engines.
>> Anyone has any thoughts? Daniel?
>
>Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
>
>So a cleaner interface to me is: user space create a vm,  create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
>
>I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
>
>From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
>
>I didn't completely follow the discussion here. Just share some thoughts.
>

Yah, I agree.

Lionel,
How about we define the queue as
union {
        __u32 queue_idx;
        __u64 rsvd;
}

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later
with a flag.

Niranjana

>Regards,
>Oak
>
>>
>> Niranjana
>>
>> >
>> >>I think the interface is clean as a interface to VM. It is only that we
>> >>don't have a clean way to create a raw VM_BIND engine (not
>> >>associated with
>> >>any context) with i915 uapi.
>> >>May be we can add such an interface, but I don't think that is worth it
>> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>> >>mentioned
>> >>above).
>> >>Anyone has any thoughts?
>> >>
>> >>>
>> >>>>Another problem is, if two VMs are binding with the same defined
>> >>>>engine,
>> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
>> >>>>(which may be
>> >>>>waiting on its in_fence).
>> >>>
>> >>>
>> >>>Maybe I'm missing something, but how can you have 2 vm objects
>> >>>with a single gem_context right now?
>> >>>
>> >>
>> >>No, we don't have 2 VMs for a gem_context.
>> >>Say if ctx1 with vm1 and ctx2 with vm2.
>> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>> >>those two queue indicies points to same underlying vm_bind engine,
>> >>then the second vm_bind call gets blocked until the first vm_bind call's
>> >>'in' fence is triggered and bind completes.
>> >>
>> >>With per VM queues, this is not a problem as two VMs will not endup
>> >>sharing same queue.
>> >>
>> >>BTW, I just posted a updated PATCH series.
>> >>https://www.spinics.net/lists/dri-devel/msg350483.html
>> >>
>> >>Niranjana
>> >>
>> >>>
>> >>>>
>> >>>>So, my preference here is to just add a 'u32 queue' index in
>> >>>>vm_bind/unbind
>> >>>>ioctl, and the queues are per VM.
>> >>>>
>> >>>>Niranjana
>> >>>>
>> >>>>>  Thanks,
>> >>>>>
>> >>>>>  -Lionel
>> >>>>>
>> >>>>>
>> >>>>>      Niranjana
>> >>>>>
>> >>>>>      >Regards,
>> >>>>>      >
>> >>>>>      >Tvrtko
>> >>>>>      >
>> >>>>>      >>
>> >>>>>      >>Niranjana
>> >>>>>      >>
>> >>>>>      >>>
>> >>>>>      >>>>   I am trying to see how many queues we need and
>> >>>>>don't want it to
>> >>>>>      be
>> >>>>>      >>>>   arbitrarily
>> >>>>>      >>>>   large and unduely blow up memory usage and
>> >>>>>complexity in i915
>> >>>>>      driver.
>> >>>>>      >>>>
>> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>> >>>>>vast majority
>> >>>>>      >>>>of cases. I
>> >>>>>      >>>> could imagine a client wanting to create more than 1 sparse
>> >>>>>      >>>>queue in which
>> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>> >>>>>complexity
>> >>>>>      >>>>goes, once
>> >>>>>      >>>> you allow two, I don't think the complexity is going up by
>> >>>>>      >>>>allowing N.  As
>> >>>>>      >>>> for memory usage, creating more queues means more
>> >>>>>memory.  That's
>> >>>>>      a
>> >>>>>      >>>> trade-off that userspace can make. Again, the
>> >>>>>expected number
>> >>>>>      >>>>here is 1
>> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
>> >>>>>you need to
>> >>>>>      worry.
>> >>>>>      >>>
>> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>> >>>>>      >>>That would require us create 8 workqueues.
>> >>>>>      >>>We can change 'n' later if required.
>> >>>>>      >>>
>> >>>>>      >>>Niranjana
>> >>>>>      >>>
>> >>>>>      >>>>
>> >>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>> >>>>>      >>>>operations and we
>> >>>>>      >>>>   don't
>> >>>>>      >>>>   >     want any dependencies between them:
>> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
>> >>>>>creation or
>> >>>>>      >>>>maybe as
>> >>>>>      >>>>   part of
>> >>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>> >>>>>      >>>>don't happen
>> >>>>>      >>>>   on a
>> >>>>>      >>>>   >     queue and we don't want them serialized
>> >>>>>with anything.       To
>> >>>>>      >>>>   synchronize
>> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
>> >>>>>VkDevice which
>> >>>>>      is
>> >>>>>      >>>>   signaled by
>> >>>>>      >>>>   >     all immediate bind operations and make
>> >>>>>submits wait on
>> >>>>>      it.
>> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
>> >>>>>VkQueue which may
>> >>>>>      be the
>> >>>>>      >>>>   same as
>> >>>>>      >>>>   >     a render/compute queue or may be its own
>> >>>>>queue.  It's up
>> >>>>>      to us
>> >>>>>      >>>>   what we
>> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
>> >>>>>PoV, this is like
>> >>>>>      any
>> >>>>>      >>>>   other
>> >>>>>      >>>>   >     queue.  Operations on it wait on and signal
>> >>>>>semaphores.       If we
>> >>>>>      >>>>   have a
>> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>> >>>>>      >>>>signal just like
>> >>>>>      >>>>   we do
>> >>>>>      >>>>   >     in execbuf().
>> >>>>>      >>>>   >     The important thing is that we don't want
>> >>>>>one type of
>> >>>>>      >>>>operation to
>> >>>>>      >>>>   block
>> >>>>>      >>>>   >     on the other.  If immediate binds are
>> >>>>>blocking on sparse
>> >>>>>      binds,
>> >>>>>      >>>>   it's
>> >>>>>      >>>>   >     going to cause over-synchronization issues.
>> >>>>>      >>>>   >     In terms of the internal implementation, I
>> >>>>>know that
>> >>>>>      >>>>there's going
>> >>>>>      >>>>   to be
>> >>>>>      >>>>   >     a lock on the VM and that we can't actually
>> >>>>>do these
>> >>>>>      things in
>> >>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
>> >>>>>      signaled and
>> >>>>>      >>>>   we're
>> >>>>>      >>>>
>> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>> >>>>>      >>>>multiple queues
>> >>>>>      >>>>   feeding to it.
>> >>>>>      >>>>
>> >>>>>      >>>> Right.  As long as the queues themselves are
>> >>>>>independent and
>> >>>>>      >>>>can block on
>> >>>>>      >>>> dma_fences without holding up other queues, I think
>> >>>>>we're fine.
>> >>>>>      >>>>
>> >>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>> >>>>>      >>>>there's a bit
>> >>>>>      >>>>   of
>> >>>>>      >>>>   >     synchronization due to locking.  That's
>> >>>>>expected.  What
>> >>>>>      >>>>we can't
>> >>>>>      >>>>   afford
>> >>>>>      >>>>   >     to have is an immediate bind operation
>> >>>>>suddenly blocking
>> >>>>>      on a
>> >>>>>      >>>>   sparse
>> >>>>>      >>>>   >     operation which is blocked on a compute job
>> >>>>>that's going
>> >>>>>      to run
>> >>>>>      >>>>   for
>> >>>>>      >>>>   >     another 5ms.
>> >>>>>      >>>>
>> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>> >>>>>doesn't block
>> >>>>>      the
>> >>>>>      >>>>   VM_BIND
>> >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>> >>>>>      wanted to
>> >>>>>      >>>>   clarify.
>> >>>>>      >>>>
>> >>>>>      >>>> Yes, that's what I would expect.
>> >>>>>      >>>> --Jason
>> >>>>>      >>>>
>> >>>>>      >>>>   Niranjana
>> >>>>>      >>>>
>> >>>>>      >>>>   >     For reference, Windows solves this by allowing
>> >>>>>      arbitrarily many
>> >>>>>      >>>>   paging
>> >>>>>      >>>>   >     queues (what they call a VM_BIND
>> >>>>>engine/queue).  That
>> >>>>>      >>>>design works
>> >>>>>      >>>>   >     pretty well and solves the problems in
>> >>>>>question.       >>>>Again, we could
>> >>>>>      >>>>   just
>> >>>>>      >>>>   >     make everything out-of-order and require
>> >>>>>using syncobjs
>> >>>>>      >>>>to order
>> >>>>>      >>>>   things
>> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
>> >>>>>      >>>>   >     One more note while I'm here: danvet said
>> >>>>>something on
>> >>>>>      >>>>IRC about
>> >>>>>      >>>>   VM_BIND
>> >>>>>      >>>>   >     queues waiting for syncobjs to
>> >>>>>materialize.  We don't
>> >>>>>      really
>> >>>>>      >>>>   want/need
>> >>>>>      >>>>   >     this.  We already have all the machinery in
>> >>>>>userspace to
>> >>>>>      handle
>> >>>>>      >>>>   >     wait-before-signal and waiting for syncobj
>> >>>>>fences to
>> >>>>>      >>>>materialize
>> >>>>>      >>>>   and
>> >>>>>      >>>>   >     that machinery is on by default.  It would actually
>> >>>>>      >>>>take MORE work
>> >>>>>      >>>>   in
>> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
>> >>>>>the kernel
>> >>>>>      >>>>being able to
>> >>>>>      >>>>   wait
>> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>> >>>>>that right is
>> >>>>>      >>>>   ridiculously
>> >>>>>      >>>>   >     hard and I really don't want to get it
>> >>>>>wrong in kernel
>> >>>>>      >>>>space.   �� When we
>> >>>>>      >>>>   >     do memory fences, wait-before-signal will
>> >>>>>be a thing.  We
>> >>>>>      don't
>> >>>>>      >>>>   need to
>> >>>>>      >>>>   >     try and make it a thing for syncobj.
>> >>>>>      >>>>   >     --Jason
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   Thanks Jason,
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>> >>>>>we're allowed to
>> >>>>>      have a
>> >>>>>      >>>>   sparse
>> >>>>>      >>>>   >   queue that does not implement either graphics
>> >>>>>or compute
>> >>>>>      >>>>operations
>> >>>>>      >>>>   :
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     "While some implementations may include
>> >>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>> >>>>>      >>>>   >     support in queue families that also include
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >      graphics and compute support, other
>> >>>>>implementations may
>> >>>>>      only
>> >>>>>      >>>>   expose a
>> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >      family."
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>> >>>>>      bind/unbind
>> >>>>>      >>>>   >   operations.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   But yes we need another engine for the
>> >>>>>immediate/non-sparse
>> >>>>>      >>>>   operations.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   -Lionel
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >         >
>> >>>>>      >>>>   >       Daniel, any thoughts?
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Niranjana
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >Matt
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Sorry I noticed this late.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> -Lionel
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>
>> >>>
>> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-13 18:02                                           ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-13 18:02 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Wilson, Chris P, Intel GFX, Maling list - DRI developers,
	Hellstrom, Thomas, Vetter, Daniel, Christian König

On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Niranjana
>> Vishwanathapura
>> Sent: June 10, 2022 1:43 PM
>> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
>> devel@lists.freedesktop.org>; Hellstrom, Thomas <thomas.hellstrom@intel.com>;
>> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
>> document
>>
>> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>> >>>>>
>> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>> >>>>>    <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>
>> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>> >>>>>      >
>> >>>>>      >
>> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>> >>>>>Vishwanathapura
>> >>>>>      wrote:
>> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>> >>>>>Ekstrand wrote:
>> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>      >>>>
>> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>> >>>>>Landwerlin
>> >>>>>      wrote:
>> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>> >>>>>Vishwanathapura
>> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>> >>>>>      >>>>Brost wrote:
>> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>> >>>>>      Landwerlin
>> >>>>>      >>>>   wrote:
>> >>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>> >>>>>      wrote:
>> >>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>> >>>>>      >>>>   binding/unbinding
>> >>>>>      >>>>   >       the mapping in an
>> >>>>>      >>>>   >       >> > +async worker. The binding and
>> >>>>>unbinding will
>> >>>>>      >>>>work like a
>> >>>>>      >>>>   special
>> >>>>>      >>>>   >       GPU engine.
>> >>>>>      >>>>   >       >> > +The binding and unbinding operations are
>> >>>>>      serialized and
>> >>>>>      >>>>   will
>> >>>>>      >>>>   >       wait on specified
>> >>>>>      >>>>   >       >> > +input fences before the operation
>> >>>>>and will signal
>> >>>>>      the
>> >>>>>      >>>>   output
>> >>>>>      >>>>   >       fences upon the
>> >>>>>      >>>>   >       >> > +completion of the operation. Due to
>> >>>>>      serialization,
>> >>>>>      >>>>   completion of
>> >>>>>      >>>>   >       an operation
>> >>>>>      >>>>   >       >> > +will also indicate that all
>> >>>>>previous operations
>> >>>>>      >>>>are also
>> >>>>>      >>>>   >       complete.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I guess we should avoid saying "will
>> >>>>>immediately
>> >>>>>      start
>> >>>>>      >>>>   >       binding/unbinding" if
>> >>>>>      >>>>   >       >> there are fences involved.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> And the fact that it's happening in an async
>> >>>>>      >>>>worker seem to
>> >>>>>      >>>>   imply
>> >>>>>      >>>>   >       it's not
>> >>>>>      >>>>   >       >> immediate.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Ok, will fix.
>> >>>>>      >>>>   >       This was added because in earlier design
>> >>>>>binding was
>> >>>>>      deferred
>> >>>>>      >>>>   until
>> >>>>>      >>>>   >       next execbuff.
>> >>>>>      >>>>   >       But now it is non-deferred (immediate in
>> >>>>>that sense).
>> >>>>>      >>>>But yah,
>> >>>>>      >>>>   this is
>> >>>>>      >>>>   >       confusing
>> >>>>>      >>>>   >       and will fix it.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I have a question on the behavior of the bind
>> >>>>>      >>>>operation when
>> >>>>>      >>>>   no
>> >>>>>      >>>>   >       input fence
>> >>>>>      >>>>   >       >> is provided. Let say I do :
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In what order are the fences going to
>> >>>>>be signaled?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out
>> >>>>>of order?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Because you wrote "serialized I assume
>> >>>>>it's : in
>> >>>>>      order
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>> >>>>>ioctls. Note that
>> >>>>>      >>>>bind and
>> >>>>>      >>>>   unbind
>> >>>>>      >>>>   >       will use
>> >>>>>      >>>>   >       the same queue and hence are ordered.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> One thing I didn't realize is that
>> >>>>>because we only
>> >>>>>      get one
>> >>>>>      >>>>   >       "VM_BIND" engine,
>> >>>>>      >>>>   >       >> there is a disconnect from the Vulkan
>> >>>>>specification.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In Vulkan VM_BIND operations are
>> >>>>>serialized but
>> >>>>>      >>>>per engine.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> So you could have something like this :
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>> >>>>>      out_fence=fence2)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>> >>>>>      out_fence=fence4)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> fence1 is not signaled
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> fence3 is signaled
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>> >>>>>      >>>>first VM_BIND.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I guess we can deal with that scenario in
>> >>>>>      >>>>userspace by doing
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >       wait
>> >>>>>      >>>>   >       >> ourselves in one thread per engines.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> But then it makes the VM_BIND input
>> >>>>>fences useless.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Daniel : what do you think? Should be
>> >>>>>rework this or
>> >>>>>      just
>> >>>>>      >>>>   deal with
>> >>>>>      >>>>   >       wait
>> >>>>>      >>>>   >       >> fences in userspace?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >My opinion is rework this but make the
>> >>>>>ordering via
>> >>>>>      >>>>an engine
>> >>>>>      >>>>   param
>> >>>>>      >>>>   >       optional.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
>> >>>>>are ordered
>> >>>>>      >>>>within the
>> >>>>>      >>>>   VM
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
>> >>>>>accept an
>> >>>>>      engine
>> >>>>>      >>>>   argument
>> >>>>>      >>>>   >       (in
>> >>>>>      >>>>   >       >the case of the i915 likely this is a
>> >>>>>gem context
>> >>>>>      >>>>handle) and
>> >>>>>      >>>>   binds
>> >>>>>      >>>>   >       >ordered with respect to that engine.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >This gives UMDs options as the later
>> >>>>>likely consumes
>> >>>>>      >>>>more KMD
>> >>>>>      >>>>   >       resources
>> >>>>>      >>>>   >       >so if a different UMD can live with binds being
>> >>>>>      >>>>ordered within
>> >>>>>      >>>>   the VM
>> >>>>>      >>>>   >       >they can use a mode consuming less resources.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       I think we need to be careful here if we
>> >>>>>are looking
>> >>>>>      for some
>> >>>>>      >>>>   out of
>> >>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>> >>>>>      >>>>   >       In-order completion means, in a batch of
>> >>>>>binds and
>> >>>>>      >>>>unbinds to be
>> >>>>>      >>>>   >       completed in-order, user only needs to specify
>> >>>>>      >>>>in-fence for the
>> >>>>>      >>>>   >       first bind/unbind call and the our-fence
>> >>>>>for the last
>> >>>>>      >>>>   bind/unbind
>> >>>>>      >>>>   >       call. Also, the VA released by an unbind
>> >>>>>call can be
>> >>>>>      >>>>re-used by
>> >>>>>      >>>>   >       any subsequent bind call in that in-order batch.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       These things will break if
>> >>>>>binding/unbinding were to
>> >>>>>      >>>>be allowed
>> >>>>>      >>>>   to
>> >>>>>      >>>>   >       go out of order (of submission) and user
>> >>>>>need to be
>> >>>>>      extra
>> >>>>>      >>>>   careful
>> >>>>>      >>>>   >       not to run into pre-mature triggereing of
>> >>>>>out-fence and
>> >>>>>      bind
>> >>>>>      >>>>   failing
>> >>>>>      >>>>   >       as VA is still in use etc.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>> >>>>>      specified
>> >>>>>      >>>>   address
>> >>>>>      >>>>   >       space
>> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>> >>>>>specific.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
>> >>>>>which can be
>> >>>>>      >>>>one from
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >       pre-defined queues,
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>> >>>>>      >>>>   >       ...
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       KMD will spawn an async work queue for
>> >>>>>each queue which
>> >>>>>      will
>> >>>>>      >>>>   only
>> >>>>>      >>>>   >       bind the mappings on that queue in the order of
>> >>>>>      submission.
>> >>>>>      >>>>   >       User can assign the queue to per engine
>> >>>>>or anything
>> >>>>>      >>>>like that.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       But again here, user need to be careful and not
>> >>>>>      >>>>deadlock these
>> >>>>>      >>>>   >       queues with circular dependency of fences.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       I prefer adding this later an as
>> >>>>>extension based on
>> >>>>>      >>>>whether it
>> >>>>>      >>>>   >       is really helping with the implementation.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     I can tell you right now that having
>> >>>>>everything on a
>> >>>>>      single
>> >>>>>      >>>>   in-order
>> >>>>>      >>>>   >     queue will not get us the perf we want.
>> >>>>>What vulkan
>> >>>>>      >>>>really wants
>> >>>>>      >>>>   is one
>> >>>>>      >>>>   >     of two things:
>> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>> >>>>>      happen in
>> >>>>>      >>>>   whatever
>> >>>>>      >>>>   >     their dependencies are resolved and we
>> >>>>>ensure ordering
>> >>>>>      >>>>ourselves
>> >>>>>      >>>>   by
>> >>>>>      >>>>   >     having a syncobj in the VkQueue.
>> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>> >>>>>queues.  We
>> >>>>>      need at
>> >>>>>      >>>>   least 2
>> >>>>>      >>>>   >     but I don't see why there needs to be a
>> >>>>>limit besides
>> >>>>>      >>>>the limits
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >     i915 API already has on the number of
>> >>>>>engines.  Vulkan
>> >>>>>      could
>> >>>>>      >>>>   expose
>> >>>>>      >>>>   >     multiple sparse binding queues to the
>> >>>>>client if it's not
>> >>>>>      >>>>   arbitrarily
>> >>>>>      >>>>   >     limited.
>> >>>>>      >>>>
>> >>>>>      >>>>   Thanks Jason, Lionel.
>> >>>>>      >>>>
>> >>>>>      >>>>   Jason, what are you referring to when you say
>> >>>>>"limits the i915
>> >>>>>      API
>> >>>>>      >>>>   already
>> >>>>>      >>>>   has on the number of engines"? I am not sure if
>> >>>>>there is such
>> >>>>>      an uapi
>> >>>>>      >>>>   today.
>> >>>>>      >>>>
>> >>>>>      >>>> There's a limit of something like 64 total engines
>> >>>>>today based on
>> >>>>>      the
>> >>>>>      >>>> number of bits we can cram into the exec flags in
>> >>>>>execbuffer2.  I
>> >>>>>      think
>> >>>>>      >>>> someone had an extended version that allowed more
>> >>>>>but I ripped it
>> >>>>>      out
>> >>>>>      >>>> because no one was using it.  Of course,
>> >>>>>execbuffer3 might not
>> >>>>>      >>>>have that
>> >>>>>      >>>> problem at all.
>> >>>>>      >>>>
>> >>>>>      >>>
>> >>>>>      >>>Thanks Jason.
>> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>> >>>>>execbuffer3
>> >>>>>      probably
>> >>>>>      >>>will not have this limiation. So, we need to define a
>> >>>>>      VM_BIND_MAX_QUEUE
>> >>>>>      >>>and somehow export it to user (I am thinking of
>> >>>>>embedding it in
>> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>> >>>>>      meaning 2^n
>> >>>>>      >>>queues.
>> >>>>>      >>
>> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>> >>>>>(0x3f) which
>> >>>>>      execbuf3
>> >>>>>
>> >>>>>    Yup!  That's exactly the limit I was talking about.
>> >>>>>
>> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>> >>>>>      structures,
>> >>>>>      >>
>> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>> >>>>>      >>        __u32 queue;
>> >>>>>      >>
>> >>>>>      >>I think that will keep things simple.
>> >>>>>      >
>> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>> >>>>>      >hardware can have? I suggest not to do that.
>> >>>>>      >
>> >>>>>      >Change with added this:
>> >>>>>      >
>> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>> >>>>>      >               return -EINVAL;
>> >>>>>      >
>> >>>>>      >To context creation needs to be undone and so let users
>> >>>>>create engine
>> >>>>>      >maps with all hardware engines, and let execbuf3 access
>> >>>>>them all.
>> >>>>>      >
>> >>>>>
>> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>> >>>>>execbuff3 also.
>> >>>>>      Hence, I was using the same limit for VM_BIND queues
>> >>>>>(64, or 65 if we
>> >>>>>      make it N+1).
>> >>>>>      But, as discussed in other thread of this RFC series, we
>> >>>>>are planning
>> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>> >>>>>      any uapi that limits the number of engines (and hence
>> >>>>>the vm_bind
>> >>>>>      queues
>> >>>>>      need to be supported).
>> >>>>>
>> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large
>> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>> >>>>>queue (a wq,
>> >>>>>      work_item and a linked list) lookup from the user
>> >>>>>specified queue
>> >>>>>      index.
>> >>>>>      Other option is to just put some hard limit (say 64 or
>> >>>>>65) and use
>> >>>>>      an array of queues in VM (each created upon first use).
>> >>>>>I prefer this.
>> >>>>>
>> >>>>>    I don't get why a VM_BIND queue is any different from any
>> >>>>>other queue or
>> >>>>>    userspace-visible kernel object.  But I'll leave those
>> >>>>>details up to
>> >>>>>    danvet or whoever else might be reviewing the implementation.
>> >>>>>    --Jason
>> >>>>>
>> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>> >>>>>queue created
>> >>>>>  like the others when we build the engine map?
>> >>>>>
>> >>>>>  For userspace it's then just matter of selecting the right
>> >>>>>queue ID when
>> >>>>>  submitting.
>> >>>>>
>> >>>>>  If there is ever a possibility to have this work on the GPU,
>> >>>>>it would be
>> >>>>>  all ready.
>> >>>>>
>> >>>>
>> >>>>I did sync offline with Matt Brost on this.
>> >>>>We can add a VM_BIND engine class and let user create VM_BIND
>> >>>>engines (queues).
>> >>>>The problem is, in i915 engine creating interface is bound to
>> >>>>gem_context.
>> >>>>So, in vm_bind ioctl, we would need both context_id and
>> >>>>queue_idx for proper
>> >>>>lookup of the user created engine. This is bit ackward as vm_bind is an
>> >>>>interface to VM (address space) and has nothing to do with gem_context.
>> >>>
>> >>>
>> >>>A gem_context has a single vm object right?
>> >>>
>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>> >>>one if not.
>> >>>
>> >>>So it's just like picking up the vm like it's done at execbuffer
>> >>>time right now : eb->context->vm
>> >>>
>> >>
>> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
>> >>VM_BIND/UNBIND
>> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>> >>obtained
>> >>from the context?
>> >
>> >
>> >Yes, because if we go for engines, they're associated with a context
>> >and so also associated with the VM bound to the context.
>> >
>>
>> Hmm...context doesn't sould like the right interface. It should be
>> VM and engine (independent of context). Engine can be virtual or soft
>> engine (kernel thread), each with its own queue. We can add an interface
>> to create such engines (independent of context). But we are anway
>> implicitly creating it when user uses a new queue_idx. If in future
>> we have hardware engines for VM_BIND operation, we can have that
>> explicit inteface to create engine instances and the queue_index
>> in vm_bind/unbind will point to those engines.
>> Anyone has any thoughts? Daniel?
>
>Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
>
>So a cleaner interface to me is: user space create a vm,  create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
>
>I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
>
>From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
>
>I didn't completely follow the discussion here. Just share some thoughts.
>

Yah, I agree.

Lionel,
How about we define the queue as
union {
        __u32 queue_idx;
        __u64 rsvd;
}

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later
with a flag.

Niranjana

>Regards,
>Oak
>
>>
>> Niranjana
>>
>> >
>> >>I think the interface is clean as a interface to VM. It is only that we
>> >>don't have a clean way to create a raw VM_BIND engine (not
>> >>associated with
>> >>any context) with i915 uapi.
>> >>May be we can add such an interface, but I don't think that is worth it
>> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>> >>mentioned
>> >>above).
>> >>Anyone has any thoughts?
>> >>
>> >>>
>> >>>>Another problem is, if two VMs are binding with the same defined
>> >>>>engine,
>> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
>> >>>>(which may be
>> >>>>waiting on its in_fence).
>> >>>
>> >>>
>> >>>Maybe I'm missing something, but how can you have 2 vm objects
>> >>>with a single gem_context right now?
>> >>>
>> >>
>> >>No, we don't have 2 VMs for a gem_context.
>> >>Say if ctx1 with vm1 and ctx2 with vm2.
>> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>> >>those two queue indicies points to same underlying vm_bind engine,
>> >>then the second vm_bind call gets blocked until the first vm_bind call's
>> >>'in' fence is triggered and bind completes.
>> >>
>> >>With per VM queues, this is not a problem as two VMs will not endup
>> >>sharing same queue.
>> >>
>> >>BTW, I just posted a updated PATCH series.
>> >>https://www.spinics.net/lists/dri-devel/msg350483.html
>> >>
>> >>Niranjana
>> >>
>> >>>
>> >>>>
>> >>>>So, my preference here is to just add a 'u32 queue' index in
>> >>>>vm_bind/unbind
>> >>>>ioctl, and the queues are per VM.
>> >>>>
>> >>>>Niranjana
>> >>>>
>> >>>>>  Thanks,
>> >>>>>
>> >>>>>  -Lionel
>> >>>>>
>> >>>>>
>> >>>>>      Niranjana
>> >>>>>
>> >>>>>      >Regards,
>> >>>>>      >
>> >>>>>      >Tvrtko
>> >>>>>      >
>> >>>>>      >>
>> >>>>>      >>Niranjana
>> >>>>>      >>
>> >>>>>      >>>
>> >>>>>      >>>>   I am trying to see how many queues we need and
>> >>>>>don't want it to
>> >>>>>      be
>> >>>>>      >>>>   arbitrarily
>> >>>>>      >>>>   large and unduely blow up memory usage and
>> >>>>>complexity in i915
>> >>>>>      driver.
>> >>>>>      >>>>
>> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>> >>>>>vast majority
>> >>>>>      >>>>of cases. I
>> >>>>>      >>>> could imagine a client wanting to create more than 1 sparse
>> >>>>>      >>>>queue in which
>> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>> >>>>>complexity
>> >>>>>      >>>>goes, once
>> >>>>>      >>>> you allow two, I don't think the complexity is going up by
>> >>>>>      >>>>allowing N.  As
>> >>>>>      >>>> for memory usage, creating more queues means more
>> >>>>>memory.  That's
>> >>>>>      a
>> >>>>>      >>>> trade-off that userspace can make. Again, the
>> >>>>>expected number
>> >>>>>      >>>>here is 1
>> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
>> >>>>>you need to
>> >>>>>      worry.
>> >>>>>      >>>
>> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>> >>>>>      >>>That would require us create 8 workqueues.
>> >>>>>      >>>We can change 'n' later if required.
>> >>>>>      >>>
>> >>>>>      >>>Niranjana
>> >>>>>      >>>
>> >>>>>      >>>>
>> >>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>> >>>>>      >>>>operations and we
>> >>>>>      >>>>   don't
>> >>>>>      >>>>   >     want any dependencies between them:
>> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
>> >>>>>creation or
>> >>>>>      >>>>maybe as
>> >>>>>      >>>>   part of
>> >>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>> >>>>>      >>>>don't happen
>> >>>>>      >>>>   on a
>> >>>>>      >>>>   >     queue and we don't want them serialized
>> >>>>>with anything.       To
>> >>>>>      >>>>   synchronize
>> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
>> >>>>>VkDevice which
>> >>>>>      is
>> >>>>>      >>>>   signaled by
>> >>>>>      >>>>   >     all immediate bind operations and make
>> >>>>>submits wait on
>> >>>>>      it.
>> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
>> >>>>>VkQueue which may
>> >>>>>      be the
>> >>>>>      >>>>   same as
>> >>>>>      >>>>   >     a render/compute queue or may be its own
>> >>>>>queue.  It's up
>> >>>>>      to us
>> >>>>>      >>>>   what we
>> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
>> >>>>>PoV, this is like
>> >>>>>      any
>> >>>>>      >>>>   other
>> >>>>>      >>>>   >     queue.  Operations on it wait on and signal
>> >>>>>semaphores.       If we
>> >>>>>      >>>>   have a
>> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>> >>>>>      >>>>signal just like
>> >>>>>      >>>>   we do
>> >>>>>      >>>>   >     in execbuf().
>> >>>>>      >>>>   >     The important thing is that we don't want
>> >>>>>one type of
>> >>>>>      >>>>operation to
>> >>>>>      >>>>   block
>> >>>>>      >>>>   >     on the other.  If immediate binds are
>> >>>>>blocking on sparse
>> >>>>>      binds,
>> >>>>>      >>>>   it's
>> >>>>>      >>>>   >     going to cause over-synchronization issues.
>> >>>>>      >>>>   >     In terms of the internal implementation, I
>> >>>>>know that
>> >>>>>      >>>>there's going
>> >>>>>      >>>>   to be
>> >>>>>      >>>>   >     a lock on the VM and that we can't actually
>> >>>>>do these
>> >>>>>      things in
>> >>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
>> >>>>>      signaled and
>> >>>>>      >>>>   we're
>> >>>>>      >>>>
>> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>> >>>>>      >>>>multiple queues
>> >>>>>      >>>>   feeding to it.
>> >>>>>      >>>>
>> >>>>>      >>>> Right.  As long as the queues themselves are
>> >>>>>independent and
>> >>>>>      >>>>can block on
>> >>>>>      >>>> dma_fences without holding up other queues, I think
>> >>>>>we're fine.
>> >>>>>      >>>>
>> >>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>> >>>>>      >>>>there's a bit
>> >>>>>      >>>>   of
>> >>>>>      >>>>   >     synchronization due to locking.  That's
>> >>>>>expected.  What
>> >>>>>      >>>>we can't
>> >>>>>      >>>>   afford
>> >>>>>      >>>>   >     to have is an immediate bind operation
>> >>>>>suddenly blocking
>> >>>>>      on a
>> >>>>>      >>>>   sparse
>> >>>>>      >>>>   >     operation which is blocked on a compute job
>> >>>>>that's going
>> >>>>>      to run
>> >>>>>      >>>>   for
>> >>>>>      >>>>   >     another 5ms.
>> >>>>>      >>>>
>> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>> >>>>>doesn't block
>> >>>>>      the
>> >>>>>      >>>>   VM_BIND
>> >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>> >>>>>      wanted to
>> >>>>>      >>>>   clarify.
>> >>>>>      >>>>
>> >>>>>      >>>> Yes, that's what I would expect.
>> >>>>>      >>>> --Jason
>> >>>>>      >>>>
>> >>>>>      >>>>   Niranjana
>> >>>>>      >>>>
>> >>>>>      >>>>   >     For reference, Windows solves this by allowing
>> >>>>>      arbitrarily many
>> >>>>>      >>>>   paging
>> >>>>>      >>>>   >     queues (what they call a VM_BIND
>> >>>>>engine/queue).  That
>> >>>>>      >>>>design works
>> >>>>>      >>>>   >     pretty well and solves the problems in
>> >>>>>question.       >>>>Again, we could
>> >>>>>      >>>>   just
>> >>>>>      >>>>   >     make everything out-of-order and require
>> >>>>>using syncobjs
>> >>>>>      >>>>to order
>> >>>>>      >>>>   things
>> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
>> >>>>>      >>>>   >     One more note while I'm here: danvet said
>> >>>>>something on
>> >>>>>      >>>>IRC about
>> >>>>>      >>>>   VM_BIND
>> >>>>>      >>>>   >     queues waiting for syncobjs to
>> >>>>>materialize.  We don't
>> >>>>>      really
>> >>>>>      >>>>   want/need
>> >>>>>      >>>>   >     this.  We already have all the machinery in
>> >>>>>userspace to
>> >>>>>      handle
>> >>>>>      >>>>   >     wait-before-signal and waiting for syncobj
>> >>>>>fences to
>> >>>>>      >>>>materialize
>> >>>>>      >>>>   and
>> >>>>>      >>>>   >     that machinery is on by default.  It would actually
>> >>>>>      >>>>take MORE work
>> >>>>>      >>>>   in
>> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
>> >>>>>the kernel
>> >>>>>      >>>>being able to
>> >>>>>      >>>>   wait
>> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>> >>>>>that right is
>> >>>>>      >>>>   ridiculously
>> >>>>>      >>>>   >     hard and I really don't want to get it
>> >>>>>wrong in kernel
>> >>>>>      >>>>space.   �� When we
>> >>>>>      >>>>   >     do memory fences, wait-before-signal will
>> >>>>>be a thing.  We
>> >>>>>      don't
>> >>>>>      >>>>   need to
>> >>>>>      >>>>   >     try and make it a thing for syncobj.
>> >>>>>      >>>>   >     --Jason
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   Thanks Jason,
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>> >>>>>we're allowed to
>> >>>>>      have a
>> >>>>>      >>>>   sparse
>> >>>>>      >>>>   >   queue that does not implement either graphics
>> >>>>>or compute
>> >>>>>      >>>>operations
>> >>>>>      >>>>   :
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     "While some implementations may include
>> >>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>> >>>>>      >>>>   >     support in queue families that also include
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >      graphics and compute support, other
>> >>>>>implementations may
>> >>>>>      only
>> >>>>>      >>>>   expose a
>> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >      family."
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>> >>>>>      bind/unbind
>> >>>>>      >>>>   >   operations.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   But yes we need another engine for the
>> >>>>>immediate/non-sparse
>> >>>>>      >>>>   operations.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   -Lionel
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >         >
>> >>>>>      >>>>   >       Daniel, any thoughts?
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Niranjana
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >Matt
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Sorry I noticed this late.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> -Lionel
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>
>> >>>
>> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-13 18:02                                           ` Niranjana Vishwanathapura
  (?)
@ 2022-06-14  7:04                                           ` Lionel Landwerlin
  2022-06-14 17:01                                               ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-14  7:04 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Zeng, Oak
  Cc: Intel GFX, Wilson, Chris P, Hellstrom, Thomas,
	Maling list - DRI developers, Vetter, Daniel,
	Christian König

On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>>
>>
>> Regards,
>> Oak
>>
>>> -----Original Message-----
>>> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf 
>>> Of Niranjana
>>> Vishwanathapura
>>> Sent: June 10, 2022 1:43 PM
>>> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>>> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI 
>>> developers <dri-
>>> devel@lists.freedesktop.org>; Hellstrom, Thomas 
>>> <thomas.hellstrom@intel.com>;
>>> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>>> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>>> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature 
>>> design
>>> document
>>>
>>> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>>> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>> >>>>>
>>> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>> >>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>> >>>>>
>>> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin 
>>> wrote:
>>> >>>>>      >
>>> >>>>>      >
>>> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>>> >>>>>Vishwanathapura
>>> >>>>>      wrote:
>>> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>>> >>>>>Ekstrand wrote:
>>> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana 
>>> Vishwanathapura
>>> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>> >>>>>      >>>>
>>> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>>> >>>>>Landwerlin
>>> >>>>>      wrote:
>>> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>>> >>>>>Vishwanathapura
>>> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, 
>>> Matthew
>>> >>>>>      >>>>Brost wrote:
>>> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, 
>>> Lionel
>>> >>>>>      Landwerlin
>>> >>>>>      >>>>   wrote:
>>> >>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>> >>>>>      wrote:
>>> >>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
>>> >>>>>      >>>>   binding/unbinding
>>> >>>>>      >>>>   >       the mapping in an
>>> >>>>>      >>>>   > >> > +async worker. The binding and
>>> >>>>>unbinding will
>>> >>>>>      >>>>work like a
>>> >>>>>      >>>>   special
>>> >>>>>      >>>>   >       GPU engine.
>>> >>>>>      >>>>   > >> > +The binding and unbinding operations are
>>> >>>>>      serialized and
>>> >>>>>      >>>>   will
>>> >>>>>      >>>>   >       wait on specified
>>> >>>>>      >>>>   > >> > +input fences before the operation
>>> >>>>>and will signal
>>> >>>>>      the
>>> >>>>>      >>>>   output
>>> >>>>>      >>>>   >       fences upon the
>>> >>>>>      >>>>   > >> > +completion of the operation. Due to
>>> >>>>>      serialization,
>>> >>>>>      >>>>   completion of
>>> >>>>>      >>>>   >       an operation
>>> >>>>>      >>>>   > >> > +will also indicate that all
>>> >>>>>previous operations
>>> >>>>>      >>>>are also
>>> >>>>>      >>>>   > complete.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> I guess we should avoid saying "will
>>> >>>>>immediately
>>> >>>>>      start
>>> >>>>>      >>>>   > binding/unbinding" if
>>> >>>>>      >>>>   > >> there are fences involved.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> And the fact that it's happening in an async
>>> >>>>>      >>>>worker seem to
>>> >>>>>      >>>>   imply
>>> >>>>>      >>>>   >       it's not
>>> >>>>>      >>>>   > >> immediate.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       Ok, will fix.
>>> >>>>>      >>>>   >       This was added because in earlier design
>>> >>>>>binding was
>>> >>>>>      deferred
>>> >>>>>      >>>>   until
>>> >>>>>      >>>>   >       next execbuff.
>>> >>>>>      >>>>   >       But now it is non-deferred (immediate in
>>> >>>>>that sense).
>>> >>>>>      >>>>But yah,
>>> >>>>>      >>>>   this is
>>> >>>>>      >>>>   > confusing
>>> >>>>>      >>>>   >       and will fix it.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> I have a question on the behavior of the bind
>>> >>>>>      >>>>operation when
>>> >>>>>      >>>>   no
>>> >>>>>      >>>>   >       input fence
>>> >>>>>      >>>>   > >> is provided. Let say I do :
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> In what order are the fences going to
>>> >>>>>be signaled?
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
>>> >>>>>of order?
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> Because you wrote "serialized I assume
>>> >>>>>it's : in
>>> >>>>>      order
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>>> >>>>>ioctls. Note that
>>> >>>>>      >>>>bind and
>>> >>>>>      >>>>   unbind
>>> >>>>>      >>>>   >       will use
>>> >>>>>      >>>>   >       the same queue and hence are ordered.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> One thing I didn't realize is that
>>> >>>>>because we only
>>> >>>>>      get one
>>> >>>>>      >>>>   > "VM_BIND" engine,
>>> >>>>>      >>>>   > >> there is a disconnect from the Vulkan
>>> >>>>>specification.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> In Vulkan VM_BIND operations are
>>> >>>>>serialized but
>>> >>>>>      >>>>per engine.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> So you could have something like this :
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
>>> >>>>>      out_fence=fence2)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
>>> >>>>>      out_fence=fence4)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> fence1 is not signaled
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> fence3 is signaled
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> So the second VM_BIND will proceed before the
>>> >>>>>      >>>>first VM_BIND.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> I guess we can deal with that scenario in
>>> >>>>>      >>>>userspace by doing
>>> >>>>>      >>>>   the
>>> >>>>>      >>>>   >       wait
>>> >>>>>      >>>>   > >> ourselves in one thread per engines.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> But then it makes the VM_BIND input
>>> >>>>>fences useless.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> Daniel : what do you think? Should be
>>> >>>>>rework this or
>>> >>>>>      just
>>> >>>>>      >>>>   deal with
>>> >>>>>      >>>>   >       wait
>>> >>>>>      >>>>   > >> fences in userspace?
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   >       >My opinion is rework this but make the
>>> >>>>>ordering via
>>> >>>>>      >>>>an engine
>>> >>>>>      >>>>   param
>>> >>>>>      >>>>   > optional.
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>> >>>>>are ordered
>>> >>>>>      >>>>within the
>>> >>>>>      >>>>   VM
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>> >>>>>accept an
>>> >>>>>      engine
>>> >>>>>      >>>>   argument
>>> >>>>>      >>>>   >       (in
>>> >>>>>      >>>>   > >the case of the i915 likely this is a
>>> >>>>>gem context
>>> >>>>>      >>>>handle) and
>>> >>>>>      >>>>   binds
>>> >>>>>      >>>>   > >ordered with respect to that engine.
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >This gives UMDs options as the later
>>> >>>>>likely consumes
>>> >>>>>      >>>>more KMD
>>> >>>>>      >>>>   > resources
>>> >>>>>      >>>>   >       >so if a different UMD can live with binds 
>>> being
>>> >>>>>      >>>>ordered within
>>> >>>>>      >>>>   the VM
>>> >>>>>      >>>>   > >they can use a mode consuming less resources.
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       I think we need to be careful here if we
>>> >>>>>are looking
>>> >>>>>      for some
>>> >>>>>      >>>>   out of
>>> >>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
>>> >>>>>      >>>>   > In-order completion means, in a batch of
>>> >>>>>binds and
>>> >>>>>      >>>>unbinds to be
>>> >>>>>      >>>>   > completed in-order, user only needs to specify
>>> >>>>>      >>>>in-fence for the
>>> >>>>>      >>>>   >       first bind/unbind call and the our-fence
>>> >>>>>for the last
>>> >>>>>      >>>>   bind/unbind
>>> >>>>>      >>>>   >       call. Also, the VA released by an unbind
>>> >>>>>call can be
>>> >>>>>      >>>>re-used by
>>> >>>>>      >>>>   >       any subsequent bind call in that in-order 
>>> batch.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       These things will break if
>>> >>>>>binding/unbinding were to
>>> >>>>>      >>>>be allowed
>>> >>>>>      >>>>   to
>>> >>>>>      >>>>   >       go out of order (of submission) and user
>>> >>>>>need to be
>>> >>>>>      extra
>>> >>>>>      >>>>   careful
>>> >>>>>      >>>>   >       not to run into pre-mature triggereing of
>>> >>>>>out-fence and
>>> >>>>>      bind
>>> >>>>>      >>>>   failing
>>> >>>>>      >>>>   >       as VA is still in use etc.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping 
>>> on the
>>> >>>>>      specified
>>> >>>>>      >>>>   address
>>> >>>>>      >>>>   >       space
>>> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>>> >>>>>specific.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
>>> >>>>>which can be
>>> >>>>>      >>>>one from
>>> >>>>>      >>>>   the
>>> >>>>>      >>>>   > pre-defined queues,
>>> >>>>>      >>>>   > I915_VM_BIND_QUEUE_0
>>> >>>>>      >>>>   > I915_VM_BIND_QUEUE_1
>>> >>>>>      >>>>   >       ...
>>> >>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       KMD will spawn an async work queue for
>>> >>>>>each queue which
>>> >>>>>      will
>>> >>>>>      >>>>   only
>>> >>>>>      >>>>   >       bind the mappings on that queue in the 
>>> order of
>>> >>>>>      submission.
>>> >>>>>      >>>>   >       User can assign the queue to per engine
>>> >>>>>or anything
>>> >>>>>      >>>>like that.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       But again here, user need to be careful 
>>> and not
>>> >>>>>      >>>>deadlock these
>>> >>>>>      >>>>   >       queues with circular dependency of fences.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       I prefer adding this later an as
>>> >>>>>extension based on
>>> >>>>>      >>>>whether it
>>> >>>>>      >>>>   >       is really helping with the implementation.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >     I can tell you right now that having
>>> >>>>>everything on a
>>> >>>>>      single
>>> >>>>>      >>>>   in-order
>>> >>>>>      >>>>   >     queue will not get us the perf we want.
>>> >>>>>What vulkan
>>> >>>>>      >>>>really wants
>>> >>>>>      >>>>   is one
>>> >>>>>      >>>>   >     of two things:
>>> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  
>>> They just
>>> >>>>>      happen in
>>> >>>>>      >>>>   whatever
>>> >>>>>      >>>>   >     their dependencies are resolved and we
>>> >>>>>ensure ordering
>>> >>>>>      >>>>ourselves
>>> >>>>>      >>>>   by
>>> >>>>>      >>>>   >     having a syncobj in the VkQueue.
>>> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>>> >>>>>queues.  We
>>> >>>>>      need at
>>> >>>>>      >>>>   least 2
>>> >>>>>      >>>>   >     but I don't see why there needs to be a
>>> >>>>>limit besides
>>> >>>>>      >>>>the limits
>>> >>>>>      >>>>   the
>>> >>>>>      >>>>   >     i915 API already has on the number of
>>> >>>>>engines.  Vulkan
>>> >>>>>      could
>>> >>>>>      >>>>   expose
>>> >>>>>      >>>>   >     multiple sparse binding queues to the
>>> >>>>>client if it's not
>>> >>>>>      >>>>   arbitrarily
>>> >>>>>      >>>>   >     limited.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Thanks Jason, Lionel.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Jason, what are you referring to when you say
>>> >>>>>"limits the i915
>>> >>>>>      API
>>> >>>>>      >>>>   already
>>> >>>>>      >>>>   has on the number of engines"? I am not sure if
>>> >>>>>there is such
>>> >>>>>      an uapi
>>> >>>>>      >>>>   today.
>>> >>>>>      >>>>
>>> >>>>>      >>>> There's a limit of something like 64 total engines
>>> >>>>>today based on
>>> >>>>>      the
>>> >>>>>      >>>> number of bits we can cram into the exec flags in
>>> >>>>>execbuffer2.  I
>>> >>>>>      think
>>> >>>>>      >>>> someone had an extended version that allowed more
>>> >>>>>but I ripped it
>>> >>>>>      out
>>> >>>>>      >>>> because no one was using it.  Of course,
>>> >>>>>execbuffer3 might not
>>> >>>>>      >>>>have that
>>> >>>>>      >>>> problem at all.
>>> >>>>>      >>>>
>>> >>>>>      >>>
>>> >>>>>      >>>Thanks Jason.
>>> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>>> >>>>>execbuffer3
>>> >>>>>      probably
>>> >>>>>      >>>will not have this limiation. So, we need to define a
>>> >>>>>      VM_BIND_MAX_QUEUE
>>> >>>>>      >>>and somehow export it to user (I am thinking of
>>> >>>>>embedding it in
>>> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, 
>>> bits[1-3]->'n'
>>> >>>>>      meaning 2^n
>>> >>>>>      >>>queues.
>>> >>>>>      >>
>>> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>>> >>>>>(0x3f) which
>>> >>>>>      execbuf3
>>> >>>>>
>>> >>>>>    Yup!  That's exactly the limit I was talking about.
>>> >>>>>
>>> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>> >>>>>      structures,
>>> >>>>>      >>
>>> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>> >>>>>      >>        __u32 queue;
>>> >>>>>      >>
>>> >>>>>      >>I think that will keep things simple.
>>> >>>>>      >
>>> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many 
>>> engines
>>> >>>>>      >hardware can have? I suggest not to do that.
>>> >>>>>      >
>>> >>>>>      >Change with added this:
>>> >>>>>      >
>>> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>> >>>>>      >               return -EINVAL;
>>> >>>>>      >
>>> >>>>>      >To context creation needs to be undone and so let users
>>> >>>>>create engine
>>> >>>>>      >maps with all hardware engines, and let execbuf3 access
>>> >>>>>them all.
>>> >>>>>      >
>>> >>>>>
>>> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>>> >>>>>execbuff3 also.
>>> >>>>>      Hence, I was using the same limit for VM_BIND queues
>>> >>>>>(64, or 65 if we
>>> >>>>>      make it N+1).
>>> >>>>>      But, as discussed in other thread of this RFC series, we
>>> >>>>>are planning
>>> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there 
>>> won't be
>>> >>>>>      any uapi that limits the number of engines (and hence
>>> >>>>>the vm_bind
>>> >>>>>      queues
>>> >>>>>      need to be supported).
>>> >>>>>
>>> >>>>>      If we leave the number of vm_bind queues to be 
>>> arbitrarily large
>>> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>>> >>>>>queue (a wq,
>>> >>>>>      work_item and a linked list) lookup from the user
>>> >>>>>specified queue
>>> >>>>>      index.
>>> >>>>>      Other option is to just put some hard limit (say 64 or
>>> >>>>>65) and use
>>> >>>>>      an array of queues in VM (each created upon first use).
>>> >>>>>I prefer this.
>>> >>>>>
>>> >>>>>    I don't get why a VM_BIND queue is any different from any
>>> >>>>>other queue or
>>> >>>>>    userspace-visible kernel object.  But I'll leave those
>>> >>>>>details up to
>>> >>>>>    danvet or whoever else might be reviewing the implementation.
>>> >>>>>    --Jason
>>> >>>>>
>>> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>>> >>>>>queue created
>>> >>>>>  like the others when we build the engine map?
>>> >>>>>
>>> >>>>>  For userspace it's then just matter of selecting the right
>>> >>>>>queue ID when
>>> >>>>>  submitting.
>>> >>>>>
>>> >>>>>  If there is ever a possibility to have this work on the GPU,
>>> >>>>>it would be
>>> >>>>>  all ready.
>>> >>>>>
>>> >>>>
>>> >>>>I did sync offline with Matt Brost on this.
>>> >>>>We can add a VM_BIND engine class and let user create VM_BIND
>>> >>>>engines (queues).
>>> >>>>The problem is, in i915 engine creating interface is bound to
>>> >>>>gem_context.
>>> >>>>So, in vm_bind ioctl, we would need both context_id and
>>> >>>>queue_idx for proper
>>> >>>>lookup of the user created engine. This is bit ackward as 
>>> vm_bind is an
>>> >>>>interface to VM (address space) and has nothing to do with 
>>> gem_context.
>>> >>>
>>> >>>
>>> >>>A gem_context has a single vm object right?
>>> >>>
>>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>>> >>>one if not.
>>> >>>
>>> >>>So it's just like picking up the vm like it's done at execbuffer
>>> >>>time right now : eb->context->vm
>>> >>>
>>> >>
>>> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
>>> >>VM_BIND/UNBIND
>>> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>>> >>obtained
>>> >>from the context?
>>> >
>>> >
>>> >Yes, because if we go for engines, they're associated with a context
>>> >and so also associated with the VM bound to the context.
>>> >
>>>
>>> Hmm...context doesn't sould like the right interface. It should be
>>> VM and engine (independent of context). Engine can be virtual or soft
>>> engine (kernel thread), each with its own queue. We can add an 
>>> interface
>>> to create such engines (independent of context). But we are anway
>>> implicitly creating it when user uses a new queue_idx. If in future
>>> we have hardware engines for VM_BIND operation, we can have that
>>> explicit inteface to create engine instances and the queue_index
>>> in vm_bind/unbind will point to those engines.
>>> Anyone has any thoughts? Daniel?
>>
>> Exposing gem_context or intel_context to user space is a strange 
>> concept to me. A context represent some hw resources that is used to 
>> complete certain task. User space should care allocate some resources 
>> (memory, queues) and submit tasks to queues. But user space doesn't 
>> care how certain task is mapped to a HW context - driver/guc should 
>> take care of this.
>>
>> So a cleaner interface to me is: user space create a vm,  create gem 
>> object, vm_bind it to a vm; allocate queues (internally represent 
>> compute or blitter HW. Queue can be virtual to user) for this vm; 
>> submit tasks to queues. User can create multiple queues under one vm. 
>> One queue is only for one vm.
>>
>> I915 driver/guc manage the hw compute or blitter resources which is 
>> transparent to user space. When i915 or guc decide to schedule a 
>> queue (run tasks on that queue), a HW engine will be pick up and set 
>> up properly for the vm of that queue (ie., switch to page tables of 
>> that vm) - this is a context switch.
>>
>> From vm_bind perspective, it simply bind a gem_object to a vm. 
>> Engine/queue is not a parameter to vm_bind, as any engine can be pick 
>> up by i915/guc to execute a task using the vm bound va.
>>
>> I didn't completely follow the discussion here. Just share some 
>> thoughts.
>>
>
> Yah, I agree.
>
> Lionel,
> How about we define the queue as
> union {
>        __u32 queue_idx;
>        __u64 rsvd;
> }
>
> If required, we can extend by expanding the 'rsvd' field to <ctx_id, 
> queue_idx> later
> with a flag.
>
> Niranjana


I did not really understand Oak's comment nor what you're suggesting 
here to be honest.


First the GEM context is already exposed to userspace. It's explicitly 
created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.

We give the GEM context id in every execbuffer we do with 
drm_i915_gem_execbuffer2::rsvd1.

It's still in the new execbuffer3 proposal being discussed.


Second, the GEM context is also where we set the VM with 
I915_CONTEXT_PARAM_VM.


Third, the GEM context also has the list of engines with 
I915_CONTEXT_PARAM_ENGINES.


So it makes sense to me to dispatch the vm_bind operation to a GEM 
context, to a given vm_bind queue, because it's got all the information 
required :

     - the list of new vm_bind queues

     - the vm that is going to be modified


Otherwise where do the vm_bind queues live?

In the i915/drm fd object?

That would mean that all the GEM contexts are sharing the same vm_bind 
queues.


intel_context or GuC are internal details we're not concerned about.

I don't really see the connection with the GEM context.


Maybe Oak has a different use case than Vulkan.


-Lionel


>
>> Regards,
>> Oak
>>
>>>
>>> Niranjana
>>>
>>> >
>>> >>I think the interface is clean as a interface to VM. It is only 
>>> that we
>>> >>don't have a clean way to create a raw VM_BIND engine (not
>>> >>associated with
>>> >>any context) with i915 uapi.
>>> >>May be we can add such an interface, but I don't think that is 
>>> worth it
>>> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>>> >>mentioned
>>> >>above).
>>> >>Anyone has any thoughts?
>>> >>
>>> >>>
>>> >>>>Another problem is, if two VMs are binding with the same defined
>>> >>>>engine,
>>> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
>>> >>>>(which may be
>>> >>>>waiting on its in_fence).
>>> >>>
>>> >>>
>>> >>>Maybe I'm missing something, but how can you have 2 vm objects
>>> >>>with a single gem_context right now?
>>> >>>
>>> >>
>>> >>No, we don't have 2 VMs for a gem_context.
>>> >>Say if ctx1 with vm1 and ctx2 with vm2.
>>> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>>> >>those two queue indicies points to same underlying vm_bind engine,
>>> >>then the second vm_bind call gets blocked until the first vm_bind 
>>> call's
>>> >>'in' fence is triggered and bind completes.
>>> >>
>>> >>With per VM queues, this is not a problem as two VMs will not endup
>>> >>sharing same queue.
>>> >>
>>> >>BTW, I just posted a updated PATCH series.
>>> >>https://www.spinics.net/lists/dri-devel/msg350483.html
>>> >>
>>> >>Niranjana
>>> >>
>>> >>>
>>> >>>>
>>> >>>>So, my preference here is to just add a 'u32 queue' index in
>>> >>>>vm_bind/unbind
>>> >>>>ioctl, and the queues are per VM.
>>> >>>>
>>> >>>>Niranjana
>>> >>>>
>>> >>>>>  Thanks,
>>> >>>>>
>>> >>>>>  -Lionel
>>> >>>>>
>>> >>>>>
>>> >>>>>      Niranjana
>>> >>>>>
>>> >>>>>      >Regards,
>>> >>>>>      >
>>> >>>>>      >Tvrtko
>>> >>>>>      >
>>> >>>>>      >>
>>> >>>>>      >>Niranjana
>>> >>>>>      >>
>>> >>>>>      >>>
>>> >>>>>      >>>>   I am trying to see how many queues we need and
>>> >>>>>don't want it to
>>> >>>>>      be
>>> >>>>>      >>>>   arbitrarily
>>> >>>>>      >>>>   large and unduely blow up memory usage and
>>> >>>>>complexity in i915
>>> >>>>>      driver.
>>> >>>>>      >>>>
>>> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>>> >>>>>vast majority
>>> >>>>>      >>>>of cases. I
>>> >>>>>      >>>> could imagine a client wanting to create more than 1 
>>> sparse
>>> >>>>>      >>>>queue in which
>>> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>>> >>>>>complexity
>>> >>>>>      >>>>goes, once
>>> >>>>>      >>>> you allow two, I don't think the complexity is going 
>>> up by
>>> >>>>>      >>>>allowing N.  As
>>> >>>>>      >>>> for memory usage, creating more queues means more
>>> >>>>>memory.  That's
>>> >>>>>      a
>>> >>>>>      >>>> trade-off that userspace can make. Again, the
>>> >>>>>expected number
>>> >>>>>      >>>>here is 1
>>> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
>>> >>>>>you need to
>>> >>>>>      worry.
>>> >>>>>      >>>
>>> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>> >>>>>      >>>That would require us create 8 workqueues.
>>> >>>>>      >>>We can change 'n' later if required.
>>> >>>>>      >>>
>>> >>>>>      >>>Niranjana
>>> >>>>>      >>>
>>> >>>>>      >>>>
>>> >>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
>>> >>>>>      >>>>operations and we
>>> >>>>>      >>>>   don't
>>> >>>>>      >>>>   >     want any dependencies between them:
>>> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
>>> >>>>>creation or
>>> >>>>>      >>>>maybe as
>>> >>>>>      >>>>   part of
>>> >>>>>      >>>>   > vkBindImageMemory() or VkBindBufferMemory().  These
>>> >>>>>      >>>>don't happen
>>> >>>>>      >>>>   on a
>>> >>>>>      >>>>   >     queue and we don't want them serialized
>>> >>>>>with anything.       To
>>> >>>>>      >>>>   synchronize
>>> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
>>> >>>>>VkDevice which
>>> >>>>>      is
>>> >>>>>      >>>>   signaled by
>>> >>>>>      >>>>   >     all immediate bind operations and make
>>> >>>>>submits wait on
>>> >>>>>      it.
>>> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
>>> >>>>>VkQueue which may
>>> >>>>>      be the
>>> >>>>>      >>>>   same as
>>> >>>>>      >>>>   >     a render/compute queue or may be its own
>>> >>>>>queue.  It's up
>>> >>>>>      to us
>>> >>>>>      >>>>   what we
>>> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
>>> >>>>>PoV, this is like
>>> >>>>>      any
>>> >>>>>      >>>>   other
>>> >>>>>      >>>>   >     queue. Operations on it wait on and signal
>>> >>>>>semaphores.       If we
>>> >>>>>      >>>>   have a
>>> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to 
>>> wait and
>>> >>>>>      >>>>signal just like
>>> >>>>>      >>>>   we do
>>> >>>>>      >>>>   >     in execbuf().
>>> >>>>>      >>>>   >     The important thing is that we don't want
>>> >>>>>one type of
>>> >>>>>      >>>>operation to
>>> >>>>>      >>>>   block
>>> >>>>>      >>>>   >     on the other.  If immediate binds are
>>> >>>>>blocking on sparse
>>> >>>>>      binds,
>>> >>>>>      >>>>   it's
>>> >>>>>      >>>>   >     going to cause over-synchronization issues.
>>> >>>>>      >>>>   >     In terms of the internal implementation, I
>>> >>>>>know that
>>> >>>>>      >>>>there's going
>>> >>>>>      >>>>   to be
>>> >>>>>      >>>>   >     a lock on the VM and that we can't actually
>>> >>>>>do these
>>> >>>>>      things in
>>> >>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
>>> >>>>>      signaled and
>>> >>>>>      >>>>   we're
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine 
>>> with
>>> >>>>>      >>>>multiple queues
>>> >>>>>      >>>>   feeding to it.
>>> >>>>>      >>>>
>>> >>>>>      >>>> Right.  As long as the queues themselves are
>>> >>>>>independent and
>>> >>>>>      >>>>can block on
>>> >>>>>      >>>> dma_fences without holding up other queues, I think
>>> >>>>>we're fine.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   > unblocked to do the bind operation, I don't care if
>>> >>>>>      >>>>there's a bit
>>> >>>>>      >>>>   of
>>> >>>>>      >>>>   > synchronization due to locking.  That's
>>> >>>>>expected.  What
>>> >>>>>      >>>>we can't
>>> >>>>>      >>>>   afford
>>> >>>>>      >>>>   >     to have is an immediate bind operation
>>> >>>>>suddenly blocking
>>> >>>>>      on a
>>> >>>>>      >>>>   sparse
>>> >>>>>      >>>>   > operation which is blocked on a compute job
>>> >>>>>that's going
>>> >>>>>      to run
>>> >>>>>      >>>>   for
>>> >>>>>      >>>>   >     another 5ms.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>>> >>>>>doesn't block
>>> >>>>>      the
>>> >>>>>      >>>>   VM_BIND
>>> >>>>>      >>>>   on other VMs. I am not sure about usecases here, 
>>> but just
>>> >>>>>      wanted to
>>> >>>>>      >>>>   clarify.
>>> >>>>>      >>>>
>>> >>>>>      >>>> Yes, that's what I would expect.
>>> >>>>>      >>>> --Jason
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Niranjana
>>> >>>>>      >>>>
>>> >>>>>      >>>>   >     For reference, Windows solves this by allowing
>>> >>>>>      arbitrarily many
>>> >>>>>      >>>>   paging
>>> >>>>>      >>>>   >     queues (what they call a VM_BIND
>>> >>>>>engine/queue).  That
>>> >>>>>      >>>>design works
>>> >>>>>      >>>>   >     pretty well and solves the problems in
>>> >>>>>question.       >>>>Again, we could
>>> >>>>>      >>>>   just
>>> >>>>>      >>>>   >     make everything out-of-order and require
>>> >>>>>using syncobjs
>>> >>>>>      >>>>to order
>>> >>>>>      >>>>   things
>>> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>> >>>>>      >>>>   >     One more note while I'm here: danvet said
>>> >>>>>something on
>>> >>>>>      >>>>IRC about
>>> >>>>>      >>>>   VM_BIND
>>> >>>>>      >>>>   >     queues waiting for syncobjs to
>>> >>>>>materialize.  We don't
>>> >>>>>      really
>>> >>>>>      >>>>   want/need
>>> >>>>>      >>>>   >     this. We already have all the machinery in
>>> >>>>>userspace to
>>> >>>>>      handle
>>> >>>>>      >>>>   > wait-before-signal and waiting for syncobj
>>> >>>>>fences to
>>> >>>>>      >>>>materialize
>>> >>>>>      >>>>   and
>>> >>>>>      >>>>   >     that machinery is on by default.  It would 
>>> actually
>>> >>>>>      >>>>take MORE work
>>> >>>>>      >>>>   in
>>> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
>>> >>>>>the kernel
>>> >>>>>      >>>>being able to
>>> >>>>>      >>>>   wait
>>> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>>> >>>>>that right is
>>> >>>>>      >>>>   ridiculously
>>> >>>>>      >>>>   >     hard and I really don't want to get it
>>> >>>>>wrong in kernel
>>> >>>>>      >>>>space.   �� When we
>>> >>>>>      >>>>   >     do memory fences, wait-before-signal will
>>> >>>>>be a thing.  We
>>> >>>>>      don't
>>> >>>>>      >>>>   need to
>>> >>>>>      >>>>   >     try and make it a thing for syncobj.
>>> >>>>>      >>>>   >     --Jason
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   Thanks Jason,
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>>> >>>>>we're allowed to
>>> >>>>>      have a
>>> >>>>>      >>>>   sparse
>>> >>>>>      >>>>   >   queue that does not implement either graphics
>>> >>>>>or compute
>>> >>>>>      >>>>operations
>>> >>>>>      >>>>   :
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >     "While some implementations may include
>>> >>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
>>> >>>>>      >>>>   >     support in queue families that also include
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > graphics and compute support, other
>>> >>>>>implementations may
>>> >>>>>      only
>>> >>>>>      >>>>   expose a
>>> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > family."
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   So it can all be all a vm_bind engine that 
>>> just does
>>> >>>>>      bind/unbind
>>> >>>>>      >>>>   > operations.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   But yes we need another engine for the
>>> >>>>>immediate/non-sparse
>>> >>>>>      >>>>   operations.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   -Lionel
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >         >
>>> >>>>>      >>>>   > Daniel, any thoughts?
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > Niranjana
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > >Matt
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> Sorry I noticed this late.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> -Lionel
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>
>>> >>>
>>> >



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-14  7:04                                           ` Lionel Landwerlin
@ 2022-06-14 17:01                                               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-14 17:01 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Wilson, Chris P, Intel GFX, Maling list - DRI developers,
	Hellstrom, Thomas, Zeng, Oak, Vetter, Daniel,
	Christian König

On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
>On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
>>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>>>
>>>
>>>Regards,
>>>Oak
>>>
>>>>-----Original Message-----
>>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On 
>>>>Behalf Of Niranjana
>>>>Vishwanathapura
>>>>Sent: June 10, 2022 1:43 PM
>>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - 
>>>>DRI developers <dri-
>>>>devel@lists.freedesktop.org>; Hellstrom, Thomas 
>>>><thomas.hellstrom@intel.com>;
>>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>>>><daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND 
>>>>feature design
>>>>document
>>>>
>>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>>>>>
>>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>
>>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko 
>>>>Ursulin wrote:
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>>>>>>>>>Vishwanathapura
>>>>>>>>>      wrote:
>>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>>>>>>>>>Ekstrand wrote:
>>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana 
>>>>Vishwanathapura
>>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>>>>>>>>>Landwerlin
>>>>>>>>>      wrote:
>>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>>>>>>>>>Vishwanathapura
>>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM 
>>>>-0700, Matthew
>>>>>>>>>      >>>>Brost wrote:
>>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM 
>>>>+0300, Lionel
>>>>>>>>>      Landwerlin
>>>>>>>>>      >>>>   wrote:
>>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>>>>>      wrote:
>>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>>>>>      >>>>   binding/unbinding
>>>>>>>>>      >>>>   >       the mapping in an
>>>>>>>>>      >>>>   > >> > +async worker. The binding and
>>>>>>>>>unbinding will
>>>>>>>>>      >>>>work like a
>>>>>>>>>      >>>>   special
>>>>>>>>>      >>>>   >       GPU engine.
>>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
>>>>>>>>>      serialized and
>>>>>>>>>      >>>>   will
>>>>>>>>>      >>>>   >       wait on specified
>>>>>>>>>      >>>>   > >> > +input fences before the operation
>>>>>>>>>and will signal
>>>>>>>>>      the
>>>>>>>>>      >>>>   output
>>>>>>>>>      >>>>   >       fences upon the
>>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
>>>>>>>>>      serialization,
>>>>>>>>>      >>>>   completion of
>>>>>>>>>      >>>>   >       an operation
>>>>>>>>>      >>>>   > >> > +will also indicate that all
>>>>>>>>>previous operations
>>>>>>>>>      >>>>are also
>>>>>>>>>      >>>>   > complete.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
>>>>>>>>>immediately
>>>>>>>>>      start
>>>>>>>>>      >>>>   > binding/unbinding" if
>>>>>>>>>      >>>>   > >> there are fences involved.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
>>>>>>>>>      >>>>worker seem to
>>>>>>>>>      >>>>   imply
>>>>>>>>>      >>>>   >       it's not
>>>>>>>>>      >>>>   > >> immediate.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Ok, will fix.
>>>>>>>>>      >>>>   >       This was added because in earlier design
>>>>>>>>>binding was
>>>>>>>>>      deferred
>>>>>>>>>      >>>>   until
>>>>>>>>>      >>>>   >       next execbuff.
>>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
>>>>>>>>>that sense).
>>>>>>>>>      >>>>But yah,
>>>>>>>>>      >>>>   this is
>>>>>>>>>      >>>>   > confusing
>>>>>>>>>      >>>>   >       and will fix it.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
>>>>>>>>>      >>>>operation when
>>>>>>>>>      >>>>   no
>>>>>>>>>      >>>>   >       input fence
>>>>>>>>>      >>>>   > >> is provided. Let say I do :
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In what order are the fences going to
>>>>>>>>>be signaled?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
>>>>>>>>>of order?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
>>>>>>>>>it's : in
>>>>>>>>>      order
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>>>>>>>>>ioctls. Note that
>>>>>>>>>      >>>>bind and
>>>>>>>>>      >>>>   unbind
>>>>>>>>>      >>>>   >       will use
>>>>>>>>>      >>>>   >       the same queue and hence are ordered.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> One thing I didn't realize is that
>>>>>>>>>because we only
>>>>>>>>>      get one
>>>>>>>>>      >>>>   > "VM_BIND" engine,
>>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
>>>>>>>>>specification.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
>>>>>>>>>serialized but
>>>>>>>>>      >>>>per engine.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> So you could have something like this :
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>>>>>>      out_fence=fence2)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>>>>>>      out_fence=fence4)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> fence1 is not signaled
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> fence3 is signaled
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
>>>>>>>>>      >>>>first VM_BIND.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
>>>>>>>>>      >>>>userspace by doing
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   >       wait
>>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
>>>>>>>>>fences useless.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
>>>>>>>>>rework this or
>>>>>>>>>      just
>>>>>>>>>      >>>>   deal with
>>>>>>>>>      >>>>   >       wait
>>>>>>>>>      >>>>   > >> fences in userspace?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   >       >My opinion is rework this but make the
>>>>>>>>>ordering via
>>>>>>>>>      >>>>an engine
>>>>>>>>>      >>>>   param
>>>>>>>>>      >>>>   > optional.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>>>>>>>>are ordered
>>>>>>>>>      >>>>within the
>>>>>>>>>      >>>>   VM
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>>>>>>>>accept an
>>>>>>>>>      engine
>>>>>>>>>      >>>>   argument
>>>>>>>>>      >>>>   >       (in
>>>>>>>>>      >>>>   > >the case of the i915 likely this is a
>>>>>>>>>gem context
>>>>>>>>>      >>>>handle) and
>>>>>>>>>      >>>>   binds
>>>>>>>>>      >>>>   > >ordered with respect to that engine.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >This gives UMDs options as the later
>>>>>>>>>likely consumes
>>>>>>>>>      >>>>more KMD
>>>>>>>>>      >>>>   > resources
>>>>>>>>>      >>>>   >       >so if a different UMD can live with 
>>>>binds being
>>>>>>>>>      >>>>ordered within
>>>>>>>>>      >>>>   the VM
>>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       I think we need to be careful here if we
>>>>>>>>>are looking
>>>>>>>>>      for some
>>>>>>>>>      >>>>   out of
>>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
>>>>>>>>>      >>>>   > In-order completion means, in a batch of
>>>>>>>>>binds and
>>>>>>>>>      >>>>unbinds to be
>>>>>>>>>      >>>>   > completed in-order, user only needs to specify
>>>>>>>>>      >>>>in-fence for the
>>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
>>>>>>>>>for the last
>>>>>>>>>      >>>>   bind/unbind
>>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
>>>>>>>>>call can be
>>>>>>>>>      >>>>re-used by
>>>>>>>>>      >>>>   >       any subsequent bind call in that 
>>>>in-order batch.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       These things will break if
>>>>>>>>>binding/unbinding were to
>>>>>>>>>      >>>>be allowed
>>>>>>>>>      >>>>   to
>>>>>>>>>      >>>>   >       go out of order (of submission) and user
>>>>>>>>>need to be
>>>>>>>>>      extra
>>>>>>>>>      >>>>   careful
>>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
>>>>>>>>>out-fence and
>>>>>>>>>      bind
>>>>>>>>>      >>>>   failing
>>>>>>>>>      >>>>   >       as VA is still in use etc.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided 
>>>>mapping on the
>>>>>>>>>      specified
>>>>>>>>>      >>>>   address
>>>>>>>>>      >>>>   >       space
>>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>>>>>>>>>specific.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
>>>>>>>>>which can be
>>>>>>>>>      >>>>one from
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   > pre-defined queues,
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
>>>>>>>>>      >>>>   >       ...
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
>>>>>>>>>each queue which
>>>>>>>>>      will
>>>>>>>>>      >>>>   only
>>>>>>>>>      >>>>   >       bind the mappings on that queue in the 
>>>>order of
>>>>>>>>>      submission.
>>>>>>>>>      >>>>   >       User can assign the queue to per engine
>>>>>>>>>or anything
>>>>>>>>>      >>>>like that.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       But again here, user need to be 
>>>>careful and not
>>>>>>>>>      >>>>deadlock these
>>>>>>>>>      >>>>   >       queues with circular dependency of fences.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       I prefer adding this later an as
>>>>>>>>>extension based on
>>>>>>>>>      >>>>whether it
>>>>>>>>>      >>>>   >       is really helping with the implementation.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     I can tell you right now that having
>>>>>>>>>everything on a
>>>>>>>>>      single
>>>>>>>>>      >>>>   in-order
>>>>>>>>>      >>>>   >     queue will not get us the perf we want.
>>>>>>>>>What vulkan
>>>>>>>>>      >>>>really wants
>>>>>>>>>      >>>>   is one
>>>>>>>>>      >>>>   >     of two things:
>>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND 
>>>>ops.  They just
>>>>>>>>>      happen in
>>>>>>>>>      >>>>   whatever
>>>>>>>>>      >>>>   >     their dependencies are resolved and we
>>>>>>>>>ensure ordering
>>>>>>>>>      >>>>ourselves
>>>>>>>>>      >>>>   by
>>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
>>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>>>>>>>>>queues.  We
>>>>>>>>>      need at
>>>>>>>>>      >>>>   least 2
>>>>>>>>>      >>>>   >     but I don't see why there needs to be a
>>>>>>>>>limit besides
>>>>>>>>>      >>>>the limits
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   >     i915 API already has on the number of
>>>>>>>>>engines.  Vulkan
>>>>>>>>>      could
>>>>>>>>>      >>>>   expose
>>>>>>>>>      >>>>   >     multiple sparse binding queues to the
>>>>>>>>>client if it's not
>>>>>>>>>      >>>>   arbitrarily
>>>>>>>>>      >>>>   >     limited.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Thanks Jason, Lionel.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Jason, what are you referring to when you say
>>>>>>>>>"limits the i915
>>>>>>>>>      API
>>>>>>>>>      >>>>   already
>>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
>>>>>>>>>there is such
>>>>>>>>>      an uapi
>>>>>>>>>      >>>>   today.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> There's a limit of something like 64 total engines
>>>>>>>>>today based on
>>>>>>>>>      the
>>>>>>>>>      >>>> number of bits we can cram into the exec flags in
>>>>>>>>>execbuffer2.  I
>>>>>>>>>      think
>>>>>>>>>      >>>> someone had an extended version that allowed more
>>>>>>>>>but I ripped it
>>>>>>>>>      out
>>>>>>>>>      >>>> because no one was using it.  Of course,
>>>>>>>>>execbuffer3 might not
>>>>>>>>>      >>>>have that
>>>>>>>>>      >>>> problem at all.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>
>>>>>>>>>      >>>Thanks Jason.
>>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>>>>>>>>>execbuffer3
>>>>>>>>>      probably
>>>>>>>>>      >>>will not have this limiation. So, we need to define a
>>>>>>>>>      VM_BIND_MAX_QUEUE
>>>>>>>>>      >>>and somehow export it to user (I am thinking of
>>>>>>>>>embedding it in
>>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, 
>>>>bits[1-3]->'n'
>>>>>>>>>      meaning 2^n
>>>>>>>>>      >>>queues.
>>>>>>>>>      >>
>>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>>>>>>>>>(0x3f) which
>>>>>>>>>      execbuf3
>>>>>>>>>
>>>>>>>>>    Yup!  That's exactly the limit I was talking about.
>>>>>>>>>
>>>>>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>>>>>>>      structures,
>>>>>>>>>      >>
>>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>>>>>>      >>        __u32 queue;
>>>>>>>>>      >>
>>>>>>>>>      >>I think that will keep things simple.
>>>>>>>>>      >
>>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how 
>>>>many engines
>>>>>>>>>      >hardware can have? I suggest not to do that.
>>>>>>>>>      >
>>>>>>>>>      >Change with added this:
>>>>>>>>>      >
>>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>>>>>>      >               return -EINVAL;
>>>>>>>>>      >
>>>>>>>>>      >To context creation needs to be undone and so let users
>>>>>>>>>create engine
>>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
>>>>>>>>>them all.
>>>>>>>>>      >
>>>>>>>>>
>>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>>>>>>>>>execbuff3 also.
>>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
>>>>>>>>>(64, or 65 if we
>>>>>>>>>      make it N+1).
>>>>>>>>>      But, as discussed in other thread of this RFC series, we
>>>>>>>>>are planning
>>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, 
>>>>there won't be
>>>>>>>>>      any uapi that limits the number of engines (and hence
>>>>>>>>>the vm_bind
>>>>>>>>>      queues
>>>>>>>>>      need to be supported).
>>>>>>>>>
>>>>>>>>>      If we leave the number of vm_bind queues to be 
>>>>arbitrarily large
>>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>>>>>>>>>queue (a wq,
>>>>>>>>>      work_item and a linked list) lookup from the user
>>>>>>>>>specified queue
>>>>>>>>>      index.
>>>>>>>>>      Other option is to just put some hard limit (say 64 or
>>>>>>>>>65) and use
>>>>>>>>>      an array of queues in VM (each created upon first use).
>>>>>>>>>I prefer this.
>>>>>>>>>
>>>>>>>>>    I don't get why a VM_BIND queue is any different from any
>>>>>>>>>other queue or
>>>>>>>>>    userspace-visible kernel object.  But I'll leave those
>>>>>>>>>details up to
>>>>>>>>>    danvet or whoever else might be reviewing the implementation.
>>>>>>>>>    --Jason
>>>>>>>>>
>>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>>>>>>>>>queue created
>>>>>>>>>  like the others when we build the engine map?
>>>>>>>>>
>>>>>>>>>  For userspace it's then just matter of selecting the right
>>>>>>>>>queue ID when
>>>>>>>>>  submitting.
>>>>>>>>>
>>>>>>>>>  If there is ever a possibility to have this work on the GPU,
>>>>>>>>>it would be
>>>>>>>>>  all ready.
>>>>>>>>>
>>>>>>>>
>>>>>>>>I did sync offline with Matt Brost on this.
>>>>>>>>We can add a VM_BIND engine class and let user create VM_BIND
>>>>>>>>engines (queues).
>>>>>>>>The problem is, in i915 engine creating interface is bound to
>>>>>>>>gem_context.
>>>>>>>>So, in vm_bind ioctl, we would need both context_id and
>>>>>>>>queue_idx for proper
>>>>>>>>lookup of the user created engine. This is bit ackward as 
>>>>vm_bind is an
>>>>>>>>interface to VM (address space) and has nothing to do with 
>>>>gem_context.
>>>>>>>
>>>>>>>
>>>>>>>A gem_context has a single vm object right?
>>>>>>>
>>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>>>>>>>one if not.
>>>>>>>
>>>>>>>So it's just like picking up the vm like it's done at execbuffer
>>>>>>>time right now : eb->context->vm
>>>>>>>
>>>>>>
>>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
>>>>>>VM_BIND/UNBIND
>>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>>>>>>obtained
>>>>>>from the context?
>>>>>
>>>>>
>>>>>Yes, because if we go for engines, they're associated with a context
>>>>>and so also associated with the VM bound to the context.
>>>>>
>>>>
>>>>Hmm...context doesn't sould like the right interface. It should be
>>>>VM and engine (independent of context). Engine can be virtual or soft
>>>>engine (kernel thread), each with its own queue. We can add an 
>>>>interface
>>>>to create such engines (independent of context). But we are anway
>>>>implicitly creating it when user uses a new queue_idx. If in future
>>>>we have hardware engines for VM_BIND operation, we can have that
>>>>explicit inteface to create engine instances and the queue_index
>>>>in vm_bind/unbind will point to those engines.
>>>>Anyone has any thoughts? Daniel?
>>>
>>>Exposing gem_context or intel_context to user space is a strange 
>>>concept to me. A context represent some hw resources that is used 
>>>to complete certain task. User space should care allocate some 
>>>resources (memory, queues) and submit tasks to queues. But user 
>>>space doesn't care how certain task is mapped to a HW context - 
>>>driver/guc should take care of this.
>>>
>>>So a cleaner interface to me is: user space create a vm,  create 
>>>gem object, vm_bind it to a vm; allocate queues (internally 
>>>represent compute or blitter HW. Queue can be virtual to user) for 
>>>this vm; submit tasks to queues. User can create multiple queues 
>>>under one vm. One queue is only for one vm.
>>>
>>>I915 driver/guc manage the hw compute or blitter resources which 
>>>is transparent to user space. When i915 or guc decide to schedule 
>>>a queue (run tasks on that queue), a HW engine will be pick up and 
>>>set up properly for the vm of that queue (ie., switch to page 
>>>tables of that vm) - this is a context switch.
>>>
>>>From vm_bind perspective, it simply bind a gem_object to a vm. 
>>>Engine/queue is not a parameter to vm_bind, as any engine can be 
>>>pick up by i915/guc to execute a task using the vm bound va.
>>>
>>>I didn't completely follow the discussion here. Just share some 
>>>thoughts.
>>>
>>
>>Yah, I agree.
>>
>>Lionel,
>>How about we define the queue as
>>union {
>>       __u32 queue_idx;
>>       __u64 rsvd;
>>}
>>
>>If required, we can extend by expanding the 'rsvd' field to <ctx_id, 
>>queue_idx> later
>>with a flag.
>>
>>Niranjana
>
>
>I did not really understand Oak's comment nor what you're suggesting 
>here to be honest.
>
>
>First the GEM context is already exposed to userspace. It's explicitly 
>created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
>
>We give the GEM context id in every execbuffer we do with 
>drm_i915_gem_execbuffer2::rsvd1.
>
>It's still in the new execbuffer3 proposal being discussed.
>
>
>Second, the GEM context is also where we set the VM with 
>I915_CONTEXT_PARAM_VM.
>
>
>Third, the GEM context also has the list of engines with 
>I915_CONTEXT_PARAM_ENGINES.
>

Yes, the execbuf and engine map creation are tied to gem_context.
(which probably is not the best interface.)

>
>So it makes sense to me to dispatch the vm_bind operation to a GEM 
>context, to a given vm_bind queue, because it's got all the 
>information required :
>
>    - the list of new vm_bind queues
>
>    - the vm that is going to be modified
>

But the operation is performed here on the address space (VM) which
can have multiple gem_contexts referring to it. So, VM is the right
interface here. We need not 'gem_context'ify it.

All we need is multiple queue support for the address space (VM).
Going to gem_context for that just because we have engine creation
support there seems unnecessay and not correct to me.

>
>Otherwise where do the vm_bind queues live?
>
>In the i915/drm fd object?
>
>That would mean that all the GEM contexts are sharing the same vm_bind 
>queues.
>

Not all, only the gem contexts that are using the same address space (VM).
But to me the right way to describe would be that "VM will be using those
queues".

Niranjana

>
>intel_context or GuC are internal details we're not concerned about.
>
>I don't really see the connection with the GEM context.
>
>
>Maybe Oak has a different use case than Vulkan.
>
>
>-Lionel
>
>
>>
>>>Regards,
>>>Oak
>>>
>>>>
>>>>Niranjana
>>>>
>>>>>
>>>>>>I think the interface is clean as a interface to VM. It is 
>>>>only that we
>>>>>>don't have a clean way to create a raw VM_BIND engine (not
>>>>>>associated with
>>>>>>any context) with i915 uapi.
>>>>>>May be we can add such an interface, but I don't think that is 
>>>>worth it
>>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>>>>>>mentioned
>>>>>>above).
>>>>>>Anyone has any thoughts?
>>>>>>
>>>>>>>
>>>>>>>>Another problem is, if two VMs are binding with the same defined
>>>>>>>>engine,
>>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
>>>>>>>>(which may be
>>>>>>>>waiting on its in_fence).
>>>>>>>
>>>>>>>
>>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
>>>>>>>with a single gem_context right now?
>>>>>>>
>>>>>>
>>>>>>No, we don't have 2 VMs for a gem_context.
>>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
>>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>>>>>>those two queue indicies points to same underlying vm_bind engine,
>>>>>>then the second vm_bind call gets blocked until the first 
>>>>vm_bind call's
>>>>>>'in' fence is triggered and bind completes.
>>>>>>
>>>>>>With per VM queues, this is not a problem as two VMs will not endup
>>>>>>sharing same queue.
>>>>>>
>>>>>>BTW, I just posted a updated PATCH series.
>>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
>>>>>>
>>>>>>Niranjana
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>So, my preference here is to just add a 'u32 queue' index in
>>>>>>>>vm_bind/unbind
>>>>>>>>ioctl, and the queues are per VM.
>>>>>>>>
>>>>>>>>Niranjana
>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>>  -Lionel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Niranjana
>>>>>>>>>
>>>>>>>>>      >Regards,
>>>>>>>>>      >
>>>>>>>>>      >Tvrtko
>>>>>>>>>      >
>>>>>>>>>      >>
>>>>>>>>>      >>Niranjana
>>>>>>>>>      >>
>>>>>>>>>      >>>
>>>>>>>>>      >>>>   I am trying to see how many queues we need and
>>>>>>>>>don't want it to
>>>>>>>>>      be
>>>>>>>>>      >>>>   arbitrarily
>>>>>>>>>      >>>>   large and unduely blow up memory usage and
>>>>>>>>>complexity in i915
>>>>>>>>>      driver.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>>>>>>>>>vast majority
>>>>>>>>>      >>>>of cases. I
>>>>>>>>>      >>>> could imagine a client wanting to create more 
>>>>than 1 sparse
>>>>>>>>>      >>>>queue in which
>>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>>>>>>>>>complexity
>>>>>>>>>      >>>>goes, once
>>>>>>>>>      >>>> you allow two, I don't think the complexity is 
>>>>going up by
>>>>>>>>>      >>>>allowing N.  As
>>>>>>>>>      >>>> for memory usage, creating more queues means more
>>>>>>>>>memory.  That's
>>>>>>>>>      a
>>>>>>>>>      >>>> trade-off that userspace can make. Again, the
>>>>>>>>>expected number
>>>>>>>>>      >>>>here is 1
>>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
>>>>>>>>>you need to
>>>>>>>>>      worry.
>>>>>>>>>      >>>
>>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>>>>>>>      >>>That would require us create 8 workqueues.
>>>>>>>>>      >>>We can change 'n' later if required.
>>>>>>>>>      >>>
>>>>>>>>>      >>>Niranjana
>>>>>>>>>      >>>
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
>>>>>>>>>      >>>>operations and we
>>>>>>>>>      >>>>   don't
>>>>>>>>>      >>>>   >     want any dependencies between them:
>>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
>>>>>>>>>creation or
>>>>>>>>>      >>>>maybe as
>>>>>>>>>      >>>>   part of
>>>>>>>>>      >>>>   > vkBindImageMemory() or VkBindBufferMemory().  These
>>>>>>>>>      >>>>don't happen
>>>>>>>>>      >>>>   on a
>>>>>>>>>      >>>>   >     queue and we don't want them serialized
>>>>>>>>>with anything.       To
>>>>>>>>>      >>>>   synchronize
>>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
>>>>>>>>>VkDevice which
>>>>>>>>>      is
>>>>>>>>>      >>>>   signaled by
>>>>>>>>>      >>>>   >     all immediate bind operations and make
>>>>>>>>>submits wait on
>>>>>>>>>      it.
>>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
>>>>>>>>>VkQueue which may
>>>>>>>>>      be the
>>>>>>>>>      >>>>   same as
>>>>>>>>>      >>>>   >     a render/compute queue or may be its own
>>>>>>>>>queue.  It's up
>>>>>>>>>      to us
>>>>>>>>>      >>>>   what we
>>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
>>>>>>>>>PoV, this is like
>>>>>>>>>      any
>>>>>>>>>      >>>>   other
>>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
>>>>>>>>>semaphores.       If we
>>>>>>>>>      >>>>   have a
>>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to 
>>>>wait and
>>>>>>>>>      >>>>signal just like
>>>>>>>>>      >>>>   we do
>>>>>>>>>      >>>>   >     in execbuf().
>>>>>>>>>      >>>>   >     The important thing is that we don't want
>>>>>>>>>one type of
>>>>>>>>>      >>>>operation to
>>>>>>>>>      >>>>   block
>>>>>>>>>      >>>>   >     on the other.  If immediate binds are
>>>>>>>>>blocking on sparse
>>>>>>>>>      binds,
>>>>>>>>>      >>>>   it's
>>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
>>>>>>>>>      >>>>   >     In terms of the internal implementation, I
>>>>>>>>>know that
>>>>>>>>>      >>>>there's going
>>>>>>>>>      >>>>   to be
>>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
>>>>>>>>>do these
>>>>>>>>>      things in
>>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
>>>>>>>>>      signaled and
>>>>>>>>>      >>>>   we're
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND 
>>>>engine with
>>>>>>>>>      >>>>multiple queues
>>>>>>>>>      >>>>   feeding to it.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> Right.  As long as the queues themselves are
>>>>>>>>>independent and
>>>>>>>>>      >>>>can block on
>>>>>>>>>      >>>> dma_fences without holding up other queues, I think
>>>>>>>>>we're fine.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
>>>>>>>>>      >>>>there's a bit
>>>>>>>>>      >>>>   of
>>>>>>>>>      >>>>   > synchronization due to locking.  That's
>>>>>>>>>expected.  What
>>>>>>>>>      >>>>we can't
>>>>>>>>>      >>>>   afford
>>>>>>>>>      >>>>   >     to have is an immediate bind operation
>>>>>>>>>suddenly blocking
>>>>>>>>>      on a
>>>>>>>>>      >>>>   sparse
>>>>>>>>>      >>>>   > operation which is blocked on a compute job
>>>>>>>>>that's going
>>>>>>>>>      to run
>>>>>>>>>      >>>>   for
>>>>>>>>>      >>>>   >     another 5ms.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>>>>>>>>>doesn't block
>>>>>>>>>      the
>>>>>>>>>      >>>>   VM_BIND
>>>>>>>>>      >>>>   on other VMs. I am not sure about usecases 
>>>>here, but just
>>>>>>>>>      wanted to
>>>>>>>>>      >>>>   clarify.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> Yes, that's what I would expect.
>>>>>>>>>      >>>> --Jason
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Niranjana
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
>>>>>>>>>      arbitrarily many
>>>>>>>>>      >>>>   paging
>>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
>>>>>>>>>engine/queue).  That
>>>>>>>>>      >>>>design works
>>>>>>>>>      >>>>   >     pretty well and solves the problems in
>>>>>>>>>question.       >>>>Again, we could
>>>>>>>>>      >>>>   just
>>>>>>>>>      >>>>   >     make everything out-of-order and require
>>>>>>>>>using syncobjs
>>>>>>>>>      >>>>to order
>>>>>>>>>      >>>>   things
>>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
>>>>>>>>>something on
>>>>>>>>>      >>>>IRC about
>>>>>>>>>      >>>>   VM_BIND
>>>>>>>>>      >>>>   >     queues waiting for syncobjs to
>>>>>>>>>materialize.  We don't
>>>>>>>>>      really
>>>>>>>>>      >>>>   want/need
>>>>>>>>>      >>>>   >     this. We already have all the machinery in
>>>>>>>>>userspace to
>>>>>>>>>      handle
>>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
>>>>>>>>>fences to
>>>>>>>>>      >>>>materialize
>>>>>>>>>      >>>>   and
>>>>>>>>>      >>>>   >     that machinery is on by default.  It 
>>>>would actually
>>>>>>>>>      >>>>take MORE work
>>>>>>>>>      >>>>   in
>>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
>>>>>>>>>the kernel
>>>>>>>>>      >>>>being able to
>>>>>>>>>      >>>>   wait
>>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>>>>>>>>>that right is
>>>>>>>>>      >>>>   ridiculously
>>>>>>>>>      >>>>   >     hard and I really don't want to get it
>>>>>>>>>wrong in kernel
>>>>>>>>>      >>>>space.   �� When we
>>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
>>>>>>>>>be a thing.  We
>>>>>>>>>      don't
>>>>>>>>>      >>>>   need to
>>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
>>>>>>>>>      >>>>   >     --Jason
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   Thanks Jason,
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>>>>>>>>>we're allowed to
>>>>>>>>>      have a
>>>>>>>>>      >>>>   sparse
>>>>>>>>>      >>>>   >   queue that does not implement either graphics
>>>>>>>>>or compute
>>>>>>>>>      >>>>operations
>>>>>>>>>      >>>>   :
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     "While some implementations may include
>>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
>>>>>>>>>      >>>>   >     support in queue families that also include
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > graphics and compute support, other
>>>>>>>>>implementations may
>>>>>>>>>      only
>>>>>>>>>      >>>>   expose a
>>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > family."
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that 
>>>>just does
>>>>>>>>>      bind/unbind
>>>>>>>>>      >>>>   > operations.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   But yes we need another engine for the
>>>>>>>>>immediate/non-sparse
>>>>>>>>>      >>>>   operations.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   -Lionel
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >         >
>>>>>>>>>      >>>>   > Daniel, any thoughts?
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > Niranjana
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >Matt
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Sorry I noticed this late.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> -Lionel
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>
>>>>>>>
>>>>>
>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-14 17:01                                               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-14 17:01 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Wilson, Chris P, Intel GFX, Maling list - DRI developers,
	Hellstrom, Thomas, Vetter, Daniel, Christian König

On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
>On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
>>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>>>
>>>
>>>Regards,
>>>Oak
>>>
>>>>-----Original Message-----
>>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On 
>>>>Behalf Of Niranjana
>>>>Vishwanathapura
>>>>Sent: June 10, 2022 1:43 PM
>>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - 
>>>>DRI developers <dri-
>>>>devel@lists.freedesktop.org>; Hellstrom, Thomas 
>>>><thomas.hellstrom@intel.com>;
>>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>>>><daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND 
>>>>feature design
>>>>document
>>>>
>>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>>>>>
>>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>
>>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko 
>>>>Ursulin wrote:
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>>>>>>>>>Vishwanathapura
>>>>>>>>>      wrote:
>>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>>>>>>>>>Ekstrand wrote:
>>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana 
>>>>Vishwanathapura
>>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>>>>>>>>>Landwerlin
>>>>>>>>>      wrote:
>>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>>>>>>>>>Vishwanathapura
>>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM 
>>>>-0700, Matthew
>>>>>>>>>      >>>>Brost wrote:
>>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM 
>>>>+0300, Lionel
>>>>>>>>>      Landwerlin
>>>>>>>>>      >>>>   wrote:
>>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>>>>>      wrote:
>>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>>>>>      >>>>   binding/unbinding
>>>>>>>>>      >>>>   >       the mapping in an
>>>>>>>>>      >>>>   > >> > +async worker. The binding and
>>>>>>>>>unbinding will
>>>>>>>>>      >>>>work like a
>>>>>>>>>      >>>>   special
>>>>>>>>>      >>>>   >       GPU engine.
>>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
>>>>>>>>>      serialized and
>>>>>>>>>      >>>>   will
>>>>>>>>>      >>>>   >       wait on specified
>>>>>>>>>      >>>>   > >> > +input fences before the operation
>>>>>>>>>and will signal
>>>>>>>>>      the
>>>>>>>>>      >>>>   output
>>>>>>>>>      >>>>   >       fences upon the
>>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
>>>>>>>>>      serialization,
>>>>>>>>>      >>>>   completion of
>>>>>>>>>      >>>>   >       an operation
>>>>>>>>>      >>>>   > >> > +will also indicate that all
>>>>>>>>>previous operations
>>>>>>>>>      >>>>are also
>>>>>>>>>      >>>>   > complete.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
>>>>>>>>>immediately
>>>>>>>>>      start
>>>>>>>>>      >>>>   > binding/unbinding" if
>>>>>>>>>      >>>>   > >> there are fences involved.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
>>>>>>>>>      >>>>worker seem to
>>>>>>>>>      >>>>   imply
>>>>>>>>>      >>>>   >       it's not
>>>>>>>>>      >>>>   > >> immediate.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Ok, will fix.
>>>>>>>>>      >>>>   >       This was added because in earlier design
>>>>>>>>>binding was
>>>>>>>>>      deferred
>>>>>>>>>      >>>>   until
>>>>>>>>>      >>>>   >       next execbuff.
>>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
>>>>>>>>>that sense).
>>>>>>>>>      >>>>But yah,
>>>>>>>>>      >>>>   this is
>>>>>>>>>      >>>>   > confusing
>>>>>>>>>      >>>>   >       and will fix it.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
>>>>>>>>>      >>>>operation when
>>>>>>>>>      >>>>   no
>>>>>>>>>      >>>>   >       input fence
>>>>>>>>>      >>>>   > >> is provided. Let say I do :
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In what order are the fences going to
>>>>>>>>>be signaled?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
>>>>>>>>>of order?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
>>>>>>>>>it's : in
>>>>>>>>>      order
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>>>>>>>>>ioctls. Note that
>>>>>>>>>      >>>>bind and
>>>>>>>>>      >>>>   unbind
>>>>>>>>>      >>>>   >       will use
>>>>>>>>>      >>>>   >       the same queue and hence are ordered.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> One thing I didn't realize is that
>>>>>>>>>because we only
>>>>>>>>>      get one
>>>>>>>>>      >>>>   > "VM_BIND" engine,
>>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
>>>>>>>>>specification.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
>>>>>>>>>serialized but
>>>>>>>>>      >>>>per engine.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> So you could have something like this :
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>>>>>>      out_fence=fence2)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>>>>>>      out_fence=fence4)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> fence1 is not signaled
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> fence3 is signaled
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
>>>>>>>>>      >>>>first VM_BIND.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
>>>>>>>>>      >>>>userspace by doing
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   >       wait
>>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
>>>>>>>>>fences useless.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
>>>>>>>>>rework this or
>>>>>>>>>      just
>>>>>>>>>      >>>>   deal with
>>>>>>>>>      >>>>   >       wait
>>>>>>>>>      >>>>   > >> fences in userspace?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   >       >My opinion is rework this but make the
>>>>>>>>>ordering via
>>>>>>>>>      >>>>an engine
>>>>>>>>>      >>>>   param
>>>>>>>>>      >>>>   > optional.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>>>>>>>>are ordered
>>>>>>>>>      >>>>within the
>>>>>>>>>      >>>>   VM
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>>>>>>>>accept an
>>>>>>>>>      engine
>>>>>>>>>      >>>>   argument
>>>>>>>>>      >>>>   >       (in
>>>>>>>>>      >>>>   > >the case of the i915 likely this is a
>>>>>>>>>gem context
>>>>>>>>>      >>>>handle) and
>>>>>>>>>      >>>>   binds
>>>>>>>>>      >>>>   > >ordered with respect to that engine.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >This gives UMDs options as the later
>>>>>>>>>likely consumes
>>>>>>>>>      >>>>more KMD
>>>>>>>>>      >>>>   > resources
>>>>>>>>>      >>>>   >       >so if a different UMD can live with 
>>>>binds being
>>>>>>>>>      >>>>ordered within
>>>>>>>>>      >>>>   the VM
>>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       I think we need to be careful here if we
>>>>>>>>>are looking
>>>>>>>>>      for some
>>>>>>>>>      >>>>   out of
>>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
>>>>>>>>>      >>>>   > In-order completion means, in a batch of
>>>>>>>>>binds and
>>>>>>>>>      >>>>unbinds to be
>>>>>>>>>      >>>>   > completed in-order, user only needs to specify
>>>>>>>>>      >>>>in-fence for the
>>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
>>>>>>>>>for the last
>>>>>>>>>      >>>>   bind/unbind
>>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
>>>>>>>>>call can be
>>>>>>>>>      >>>>re-used by
>>>>>>>>>      >>>>   >       any subsequent bind call in that 
>>>>in-order batch.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       These things will break if
>>>>>>>>>binding/unbinding were to
>>>>>>>>>      >>>>be allowed
>>>>>>>>>      >>>>   to
>>>>>>>>>      >>>>   >       go out of order (of submission) and user
>>>>>>>>>need to be
>>>>>>>>>      extra
>>>>>>>>>      >>>>   careful
>>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
>>>>>>>>>out-fence and
>>>>>>>>>      bind
>>>>>>>>>      >>>>   failing
>>>>>>>>>      >>>>   >       as VA is still in use etc.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided 
>>>>mapping on the
>>>>>>>>>      specified
>>>>>>>>>      >>>>   address
>>>>>>>>>      >>>>   >       space
>>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>>>>>>>>>specific.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
>>>>>>>>>which can be
>>>>>>>>>      >>>>one from
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   > pre-defined queues,
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
>>>>>>>>>      >>>>   >       ...
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
>>>>>>>>>each queue which
>>>>>>>>>      will
>>>>>>>>>      >>>>   only
>>>>>>>>>      >>>>   >       bind the mappings on that queue in the 
>>>>order of
>>>>>>>>>      submission.
>>>>>>>>>      >>>>   >       User can assign the queue to per engine
>>>>>>>>>or anything
>>>>>>>>>      >>>>like that.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       But again here, user need to be 
>>>>careful and not
>>>>>>>>>      >>>>deadlock these
>>>>>>>>>      >>>>   >       queues with circular dependency of fences.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       I prefer adding this later an as
>>>>>>>>>extension based on
>>>>>>>>>      >>>>whether it
>>>>>>>>>      >>>>   >       is really helping with the implementation.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     I can tell you right now that having
>>>>>>>>>everything on a
>>>>>>>>>      single
>>>>>>>>>      >>>>   in-order
>>>>>>>>>      >>>>   >     queue will not get us the perf we want.
>>>>>>>>>What vulkan
>>>>>>>>>      >>>>really wants
>>>>>>>>>      >>>>   is one
>>>>>>>>>      >>>>   >     of two things:
>>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND 
>>>>ops.  They just
>>>>>>>>>      happen in
>>>>>>>>>      >>>>   whatever
>>>>>>>>>      >>>>   >     their dependencies are resolved and we
>>>>>>>>>ensure ordering
>>>>>>>>>      >>>>ourselves
>>>>>>>>>      >>>>   by
>>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
>>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>>>>>>>>>queues.  We
>>>>>>>>>      need at
>>>>>>>>>      >>>>   least 2
>>>>>>>>>      >>>>   >     but I don't see why there needs to be a
>>>>>>>>>limit besides
>>>>>>>>>      >>>>the limits
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   >     i915 API already has on the number of
>>>>>>>>>engines.  Vulkan
>>>>>>>>>      could
>>>>>>>>>      >>>>   expose
>>>>>>>>>      >>>>   >     multiple sparse binding queues to the
>>>>>>>>>client if it's not
>>>>>>>>>      >>>>   arbitrarily
>>>>>>>>>      >>>>   >     limited.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Thanks Jason, Lionel.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Jason, what are you referring to when you say
>>>>>>>>>"limits the i915
>>>>>>>>>      API
>>>>>>>>>      >>>>   already
>>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
>>>>>>>>>there is such
>>>>>>>>>      an uapi
>>>>>>>>>      >>>>   today.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> There's a limit of something like 64 total engines
>>>>>>>>>today based on
>>>>>>>>>      the
>>>>>>>>>      >>>> number of bits we can cram into the exec flags in
>>>>>>>>>execbuffer2.  I
>>>>>>>>>      think
>>>>>>>>>      >>>> someone had an extended version that allowed more
>>>>>>>>>but I ripped it
>>>>>>>>>      out
>>>>>>>>>      >>>> because no one was using it.  Of course,
>>>>>>>>>execbuffer3 might not
>>>>>>>>>      >>>>have that
>>>>>>>>>      >>>> problem at all.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>
>>>>>>>>>      >>>Thanks Jason.
>>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>>>>>>>>>execbuffer3
>>>>>>>>>      probably
>>>>>>>>>      >>>will not have this limiation. So, we need to define a
>>>>>>>>>      VM_BIND_MAX_QUEUE
>>>>>>>>>      >>>and somehow export it to user (I am thinking of
>>>>>>>>>embedding it in
>>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, 
>>>>bits[1-3]->'n'
>>>>>>>>>      meaning 2^n
>>>>>>>>>      >>>queues.
>>>>>>>>>      >>
>>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>>>>>>>>>(0x3f) which
>>>>>>>>>      execbuf3
>>>>>>>>>
>>>>>>>>>    Yup!  That's exactly the limit I was talking about.
>>>>>>>>>
>>>>>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>>>>>>>      structures,
>>>>>>>>>      >>
>>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>>>>>>      >>        __u32 queue;
>>>>>>>>>      >>
>>>>>>>>>      >>I think that will keep things simple.
>>>>>>>>>      >
>>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how 
>>>>many engines
>>>>>>>>>      >hardware can have? I suggest not to do that.
>>>>>>>>>      >
>>>>>>>>>      >Change with added this:
>>>>>>>>>      >
>>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>>>>>>      >               return -EINVAL;
>>>>>>>>>      >
>>>>>>>>>      >To context creation needs to be undone and so let users
>>>>>>>>>create engine
>>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
>>>>>>>>>them all.
>>>>>>>>>      >
>>>>>>>>>
>>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>>>>>>>>>execbuff3 also.
>>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
>>>>>>>>>(64, or 65 if we
>>>>>>>>>      make it N+1).
>>>>>>>>>      But, as discussed in other thread of this RFC series, we
>>>>>>>>>are planning
>>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, 
>>>>there won't be
>>>>>>>>>      any uapi that limits the number of engines (and hence
>>>>>>>>>the vm_bind
>>>>>>>>>      queues
>>>>>>>>>      need to be supported).
>>>>>>>>>
>>>>>>>>>      If we leave the number of vm_bind queues to be 
>>>>arbitrarily large
>>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>>>>>>>>>queue (a wq,
>>>>>>>>>      work_item and a linked list) lookup from the user
>>>>>>>>>specified queue
>>>>>>>>>      index.
>>>>>>>>>      Other option is to just put some hard limit (say 64 or
>>>>>>>>>65) and use
>>>>>>>>>      an array of queues in VM (each created upon first use).
>>>>>>>>>I prefer this.
>>>>>>>>>
>>>>>>>>>    I don't get why a VM_BIND queue is any different from any
>>>>>>>>>other queue or
>>>>>>>>>    userspace-visible kernel object.  But I'll leave those
>>>>>>>>>details up to
>>>>>>>>>    danvet or whoever else might be reviewing the implementation.
>>>>>>>>>    --Jason
>>>>>>>>>
>>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>>>>>>>>>queue created
>>>>>>>>>  like the others when we build the engine map?
>>>>>>>>>
>>>>>>>>>  For userspace it's then just matter of selecting the right
>>>>>>>>>queue ID when
>>>>>>>>>  submitting.
>>>>>>>>>
>>>>>>>>>  If there is ever a possibility to have this work on the GPU,
>>>>>>>>>it would be
>>>>>>>>>  all ready.
>>>>>>>>>
>>>>>>>>
>>>>>>>>I did sync offline with Matt Brost on this.
>>>>>>>>We can add a VM_BIND engine class and let user create VM_BIND
>>>>>>>>engines (queues).
>>>>>>>>The problem is, in i915 engine creating interface is bound to
>>>>>>>>gem_context.
>>>>>>>>So, in vm_bind ioctl, we would need both context_id and
>>>>>>>>queue_idx for proper
>>>>>>>>lookup of the user created engine. This is bit ackward as 
>>>>vm_bind is an
>>>>>>>>interface to VM (address space) and has nothing to do with 
>>>>gem_context.
>>>>>>>
>>>>>>>
>>>>>>>A gem_context has a single vm object right?
>>>>>>>
>>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>>>>>>>one if not.
>>>>>>>
>>>>>>>So it's just like picking up the vm like it's done at execbuffer
>>>>>>>time right now : eb->context->vm
>>>>>>>
>>>>>>
>>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
>>>>>>VM_BIND/UNBIND
>>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>>>>>>obtained
>>>>>>from the context?
>>>>>
>>>>>
>>>>>Yes, because if we go for engines, they're associated with a context
>>>>>and so also associated with the VM bound to the context.
>>>>>
>>>>
>>>>Hmm...context doesn't sould like the right interface. It should be
>>>>VM and engine (independent of context). Engine can be virtual or soft
>>>>engine (kernel thread), each with its own queue. We can add an 
>>>>interface
>>>>to create such engines (independent of context). But we are anway
>>>>implicitly creating it when user uses a new queue_idx. If in future
>>>>we have hardware engines for VM_BIND operation, we can have that
>>>>explicit inteface to create engine instances and the queue_index
>>>>in vm_bind/unbind will point to those engines.
>>>>Anyone has any thoughts? Daniel?
>>>
>>>Exposing gem_context or intel_context to user space is a strange 
>>>concept to me. A context represent some hw resources that is used 
>>>to complete certain task. User space should care allocate some 
>>>resources (memory, queues) and submit tasks to queues. But user 
>>>space doesn't care how certain task is mapped to a HW context - 
>>>driver/guc should take care of this.
>>>
>>>So a cleaner interface to me is: user space create a vm,  create 
>>>gem object, vm_bind it to a vm; allocate queues (internally 
>>>represent compute or blitter HW. Queue can be virtual to user) for 
>>>this vm; submit tasks to queues. User can create multiple queues 
>>>under one vm. One queue is only for one vm.
>>>
>>>I915 driver/guc manage the hw compute or blitter resources which 
>>>is transparent to user space. When i915 or guc decide to schedule 
>>>a queue (run tasks on that queue), a HW engine will be pick up and 
>>>set up properly for the vm of that queue (ie., switch to page 
>>>tables of that vm) - this is a context switch.
>>>
>>>From vm_bind perspective, it simply bind a gem_object to a vm. 
>>>Engine/queue is not a parameter to vm_bind, as any engine can be 
>>>pick up by i915/guc to execute a task using the vm bound va.
>>>
>>>I didn't completely follow the discussion here. Just share some 
>>>thoughts.
>>>
>>
>>Yah, I agree.
>>
>>Lionel,
>>How about we define the queue as
>>union {
>>       __u32 queue_idx;
>>       __u64 rsvd;
>>}
>>
>>If required, we can extend by expanding the 'rsvd' field to <ctx_id, 
>>queue_idx> later
>>with a flag.
>>
>>Niranjana
>
>
>I did not really understand Oak's comment nor what you're suggesting 
>here to be honest.
>
>
>First the GEM context is already exposed to userspace. It's explicitly 
>created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
>
>We give the GEM context id in every execbuffer we do with 
>drm_i915_gem_execbuffer2::rsvd1.
>
>It's still in the new execbuffer3 proposal being discussed.
>
>
>Second, the GEM context is also where we set the VM with 
>I915_CONTEXT_PARAM_VM.
>
>
>Third, the GEM context also has the list of engines with 
>I915_CONTEXT_PARAM_ENGINES.
>

Yes, the execbuf and engine map creation are tied to gem_context.
(which probably is not the best interface.)

>
>So it makes sense to me to dispatch the vm_bind operation to a GEM 
>context, to a given vm_bind queue, because it's got all the 
>information required :
>
>    - the list of new vm_bind queues
>
>    - the vm that is going to be modified
>

But the operation is performed here on the address space (VM) which
can have multiple gem_contexts referring to it. So, VM is the right
interface here. We need not 'gem_context'ify it.

All we need is multiple queue support for the address space (VM).
Going to gem_context for that just because we have engine creation
support there seems unnecessay and not correct to me.

>
>Otherwise where do the vm_bind queues live?
>
>In the i915/drm fd object?
>
>That would mean that all the GEM contexts are sharing the same vm_bind 
>queues.
>

Not all, only the gem contexts that are using the same address space (VM).
But to me the right way to describe would be that "VM will be using those
queues".

Niranjana

>
>intel_context or GuC are internal details we're not concerned about.
>
>I don't really see the connection with the GEM context.
>
>
>Maybe Oak has a different use case than Vulkan.
>
>
>-Lionel
>
>
>>
>>>Regards,
>>>Oak
>>>
>>>>
>>>>Niranjana
>>>>
>>>>>
>>>>>>I think the interface is clean as a interface to VM. It is 
>>>>only that we
>>>>>>don't have a clean way to create a raw VM_BIND engine (not
>>>>>>associated with
>>>>>>any context) with i915 uapi.
>>>>>>May be we can add such an interface, but I don't think that is 
>>>>worth it
>>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>>>>>>mentioned
>>>>>>above).
>>>>>>Anyone has any thoughts?
>>>>>>
>>>>>>>
>>>>>>>>Another problem is, if two VMs are binding with the same defined
>>>>>>>>engine,
>>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
>>>>>>>>(which may be
>>>>>>>>waiting on its in_fence).
>>>>>>>
>>>>>>>
>>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
>>>>>>>with a single gem_context right now?
>>>>>>>
>>>>>>
>>>>>>No, we don't have 2 VMs for a gem_context.
>>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
>>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>>>>>>those two queue indicies points to same underlying vm_bind engine,
>>>>>>then the second vm_bind call gets blocked until the first 
>>>>vm_bind call's
>>>>>>'in' fence is triggered and bind completes.
>>>>>>
>>>>>>With per VM queues, this is not a problem as two VMs will not endup
>>>>>>sharing same queue.
>>>>>>
>>>>>>BTW, I just posted a updated PATCH series.
>>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
>>>>>>
>>>>>>Niranjana
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>So, my preference here is to just add a 'u32 queue' index in
>>>>>>>>vm_bind/unbind
>>>>>>>>ioctl, and the queues are per VM.
>>>>>>>>
>>>>>>>>Niranjana
>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>>  -Lionel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Niranjana
>>>>>>>>>
>>>>>>>>>      >Regards,
>>>>>>>>>      >
>>>>>>>>>      >Tvrtko
>>>>>>>>>      >
>>>>>>>>>      >>
>>>>>>>>>      >>Niranjana
>>>>>>>>>      >>
>>>>>>>>>      >>>
>>>>>>>>>      >>>>   I am trying to see how many queues we need and
>>>>>>>>>don't want it to
>>>>>>>>>      be
>>>>>>>>>      >>>>   arbitrarily
>>>>>>>>>      >>>>   large and unduely blow up memory usage and
>>>>>>>>>complexity in i915
>>>>>>>>>      driver.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>>>>>>>>>vast majority
>>>>>>>>>      >>>>of cases. I
>>>>>>>>>      >>>> could imagine a client wanting to create more 
>>>>than 1 sparse
>>>>>>>>>      >>>>queue in which
>>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>>>>>>>>>complexity
>>>>>>>>>      >>>>goes, once
>>>>>>>>>      >>>> you allow two, I don't think the complexity is 
>>>>going up by
>>>>>>>>>      >>>>allowing N.  As
>>>>>>>>>      >>>> for memory usage, creating more queues means more
>>>>>>>>>memory.  That's
>>>>>>>>>      a
>>>>>>>>>      >>>> trade-off that userspace can make. Again, the
>>>>>>>>>expected number
>>>>>>>>>      >>>>here is 1
>>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
>>>>>>>>>you need to
>>>>>>>>>      worry.
>>>>>>>>>      >>>
>>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>>>>>>>      >>>That would require us create 8 workqueues.
>>>>>>>>>      >>>We can change 'n' later if required.
>>>>>>>>>      >>>
>>>>>>>>>      >>>Niranjana
>>>>>>>>>      >>>
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
>>>>>>>>>      >>>>operations and we
>>>>>>>>>      >>>>   don't
>>>>>>>>>      >>>>   >     want any dependencies between them:
>>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
>>>>>>>>>creation or
>>>>>>>>>      >>>>maybe as
>>>>>>>>>      >>>>   part of
>>>>>>>>>      >>>>   > vkBindImageMemory() or VkBindBufferMemory().  These
>>>>>>>>>      >>>>don't happen
>>>>>>>>>      >>>>   on a
>>>>>>>>>      >>>>   >     queue and we don't want them serialized
>>>>>>>>>with anything.       To
>>>>>>>>>      >>>>   synchronize
>>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
>>>>>>>>>VkDevice which
>>>>>>>>>      is
>>>>>>>>>      >>>>   signaled by
>>>>>>>>>      >>>>   >     all immediate bind operations and make
>>>>>>>>>submits wait on
>>>>>>>>>      it.
>>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
>>>>>>>>>VkQueue which may
>>>>>>>>>      be the
>>>>>>>>>      >>>>   same as
>>>>>>>>>      >>>>   >     a render/compute queue or may be its own
>>>>>>>>>queue.  It's up
>>>>>>>>>      to us
>>>>>>>>>      >>>>   what we
>>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
>>>>>>>>>PoV, this is like
>>>>>>>>>      any
>>>>>>>>>      >>>>   other
>>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
>>>>>>>>>semaphores.       If we
>>>>>>>>>      >>>>   have a
>>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to 
>>>>wait and
>>>>>>>>>      >>>>signal just like
>>>>>>>>>      >>>>   we do
>>>>>>>>>      >>>>   >     in execbuf().
>>>>>>>>>      >>>>   >     The important thing is that we don't want
>>>>>>>>>one type of
>>>>>>>>>      >>>>operation to
>>>>>>>>>      >>>>   block
>>>>>>>>>      >>>>   >     on the other.  If immediate binds are
>>>>>>>>>blocking on sparse
>>>>>>>>>      binds,
>>>>>>>>>      >>>>   it's
>>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
>>>>>>>>>      >>>>   >     In terms of the internal implementation, I
>>>>>>>>>know that
>>>>>>>>>      >>>>there's going
>>>>>>>>>      >>>>   to be
>>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
>>>>>>>>>do these
>>>>>>>>>      things in
>>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
>>>>>>>>>      signaled and
>>>>>>>>>      >>>>   we're
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND 
>>>>engine with
>>>>>>>>>      >>>>multiple queues
>>>>>>>>>      >>>>   feeding to it.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> Right.  As long as the queues themselves are
>>>>>>>>>independent and
>>>>>>>>>      >>>>can block on
>>>>>>>>>      >>>> dma_fences without holding up other queues, I think
>>>>>>>>>we're fine.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
>>>>>>>>>      >>>>there's a bit
>>>>>>>>>      >>>>   of
>>>>>>>>>      >>>>   > synchronization due to locking.  That's
>>>>>>>>>expected.  What
>>>>>>>>>      >>>>we can't
>>>>>>>>>      >>>>   afford
>>>>>>>>>      >>>>   >     to have is an immediate bind operation
>>>>>>>>>suddenly blocking
>>>>>>>>>      on a
>>>>>>>>>      >>>>   sparse
>>>>>>>>>      >>>>   > operation which is blocked on a compute job
>>>>>>>>>that's going
>>>>>>>>>      to run
>>>>>>>>>      >>>>   for
>>>>>>>>>      >>>>   >     another 5ms.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>>>>>>>>>doesn't block
>>>>>>>>>      the
>>>>>>>>>      >>>>   VM_BIND
>>>>>>>>>      >>>>   on other VMs. I am not sure about usecases 
>>>>here, but just
>>>>>>>>>      wanted to
>>>>>>>>>      >>>>   clarify.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> Yes, that's what I would expect.
>>>>>>>>>      >>>> --Jason
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Niranjana
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
>>>>>>>>>      arbitrarily many
>>>>>>>>>      >>>>   paging
>>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
>>>>>>>>>engine/queue).  That
>>>>>>>>>      >>>>design works
>>>>>>>>>      >>>>   >     pretty well and solves the problems in
>>>>>>>>>question.       >>>>Again, we could
>>>>>>>>>      >>>>   just
>>>>>>>>>      >>>>   >     make everything out-of-order and require
>>>>>>>>>using syncobjs
>>>>>>>>>      >>>>to order
>>>>>>>>>      >>>>   things
>>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
>>>>>>>>>something on
>>>>>>>>>      >>>>IRC about
>>>>>>>>>      >>>>   VM_BIND
>>>>>>>>>      >>>>   >     queues waiting for syncobjs to
>>>>>>>>>materialize.  We don't
>>>>>>>>>      really
>>>>>>>>>      >>>>   want/need
>>>>>>>>>      >>>>   >     this. We already have all the machinery in
>>>>>>>>>userspace to
>>>>>>>>>      handle
>>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
>>>>>>>>>fences to
>>>>>>>>>      >>>>materialize
>>>>>>>>>      >>>>   and
>>>>>>>>>      >>>>   >     that machinery is on by default.  It 
>>>>would actually
>>>>>>>>>      >>>>take MORE work
>>>>>>>>>      >>>>   in
>>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
>>>>>>>>>the kernel
>>>>>>>>>      >>>>being able to
>>>>>>>>>      >>>>   wait
>>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>>>>>>>>>that right is
>>>>>>>>>      >>>>   ridiculously
>>>>>>>>>      >>>>   >     hard and I really don't want to get it
>>>>>>>>>wrong in kernel
>>>>>>>>>      >>>>space.   �� When we
>>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
>>>>>>>>>be a thing.  We
>>>>>>>>>      don't
>>>>>>>>>      >>>>   need to
>>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
>>>>>>>>>      >>>>   >     --Jason
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   Thanks Jason,
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>>>>>>>>>we're allowed to
>>>>>>>>>      have a
>>>>>>>>>      >>>>   sparse
>>>>>>>>>      >>>>   >   queue that does not implement either graphics
>>>>>>>>>or compute
>>>>>>>>>      >>>>operations
>>>>>>>>>      >>>>   :
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     "While some implementations may include
>>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
>>>>>>>>>      >>>>   >     support in queue families that also include
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > graphics and compute support, other
>>>>>>>>>implementations may
>>>>>>>>>      only
>>>>>>>>>      >>>>   expose a
>>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > family."
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that 
>>>>just does
>>>>>>>>>      bind/unbind
>>>>>>>>>      >>>>   > operations.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   But yes we need another engine for the
>>>>>>>>>immediate/non-sparse
>>>>>>>>>      >>>>   operations.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   -Lionel
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >         >
>>>>>>>>>      >>>>   > Daniel, any thoughts?
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > Niranjana
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >Matt
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Sorry I noticed this late.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> -Lionel
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>
>>>>>>>
>>>>>
>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-14 17:01                                               ` Niranjana Vishwanathapura
@ 2022-06-14 21:12                                                 ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-14 21:12 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, Landwerlin, Lionel G
  Cc: Intel GFX, Wilson, Chris P, Hellstrom, Thomas,
	Maling list - DRI developers, Vetter, Daniel,
	Christian König



Thanks,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 14, 2022 1:02 PM
> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Zeng, Oak <oak.zeng@intel.com>; Intel GFX <intel-
> gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Wilson, Chris P <chris.p.wilson@intel.com>;
> Vetter, Daniel <daniel.vetter@intel.com>; Christian König
> <christian.koenig@amd.com>
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> >>>
> >>>
> >>>Regards,
> >>>Oak
> >>>
> >>>>-----Original Message-----
> >>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On
> >>>>Behalf Of Niranjana
> >>>>Vishwanathapura
> >>>>Sent: June 10, 2022 1:43 PM
> >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> >>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list -
> >>>>DRI developers <dri-
> >>>>devel@lists.freedesktop.org>; Hellstrom, Thomas
> >>>><thomas.hellstrom@intel.com>;
> >>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> >>>><daniel.vetter@intel.com>; Christian König
> <christian.koenig@amd.com>
> >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> >>>>feature design
> >>>>document
> >>>>
> >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>>>>>
> >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>
> >>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> >>>>Ursulin wrote:
> >>>>>>>>>      >
> >>>>>>>>>      >
> >>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>>>>>Ekstrand wrote:
> >>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> >>>>Vishwanathapura
> >>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>>>>>Landwerlin
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM
> >>>>-0700, Matthew
> >>>>>>>>>      >>>>Brost wrote:
> >>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM
> >>>>+0300, Lionel
> >>>>>>>>>      Landwerlin
> >>>>>>>>>      >>>>   wrote:
> >>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>>>>>      >>>>   binding/unbinding
> >>>>>>>>>      >>>>   >       the mapping in an
> >>>>>>>>>      >>>>   > >> > +async worker. The binding and
> >>>>>>>>>unbinding will
> >>>>>>>>>      >>>>work like a
> >>>>>>>>>      >>>>   special
> >>>>>>>>>      >>>>   >       GPU engine.
> >>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
> >>>>>>>>>      serialized and
> >>>>>>>>>      >>>>   will
> >>>>>>>>>      >>>>   >       wait on specified
> >>>>>>>>>      >>>>   > >> > +input fences before the operation
> >>>>>>>>>and will signal
> >>>>>>>>>      the
> >>>>>>>>>      >>>>   output
> >>>>>>>>>      >>>>   >       fences upon the
> >>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
> >>>>>>>>>      serialization,
> >>>>>>>>>      >>>>   completion of
> >>>>>>>>>      >>>>   >       an operation
> >>>>>>>>>      >>>>   > >> > +will also indicate that all
> >>>>>>>>>previous operations
> >>>>>>>>>      >>>>are also
> >>>>>>>>>      >>>>   > complete.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
> >>>>>>>>>immediately
> >>>>>>>>>      start
> >>>>>>>>>      >>>>   > binding/unbinding" if
> >>>>>>>>>      >>>>   > >> there are fences involved.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
> >>>>>>>>>      >>>>worker seem to
> >>>>>>>>>      >>>>   imply
> >>>>>>>>>      >>>>   >       it's not
> >>>>>>>>>      >>>>   > >> immediate.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Ok, will fix.
> >>>>>>>>>      >>>>   >       This was added because in earlier design
> >>>>>>>>>binding was
> >>>>>>>>>      deferred
> >>>>>>>>>      >>>>   until
> >>>>>>>>>      >>>>   >       next execbuff.
> >>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
> >>>>>>>>>that sense).
> >>>>>>>>>      >>>>But yah,
> >>>>>>>>>      >>>>   this is
> >>>>>>>>>      >>>>   > confusing
> >>>>>>>>>      >>>>   >       and will fix it.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
> >>>>>>>>>      >>>>operation when
> >>>>>>>>>      >>>>   no
> >>>>>>>>>      >>>>   >       input fence
> >>>>>>>>>      >>>>   > >> is provided. Let say I do :
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In what order are the fences going to
> >>>>>>>>>be signaled?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
> >>>>>>>>>of order?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
> >>>>>>>>>it's : in
> >>>>>>>>>      order
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> >>>>>>>>>ioctls. Note that
> >>>>>>>>>      >>>>bind and
> >>>>>>>>>      >>>>   unbind
> >>>>>>>>>      >>>>   >       will use
> >>>>>>>>>      >>>>   >       the same queue and hence are ordered.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> One thing I didn't realize is that
> >>>>>>>>>because we only
> >>>>>>>>>      get one
> >>>>>>>>>      >>>>   > "VM_BIND" engine,
> >>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
> >>>>>>>>>specification.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
> >>>>>>>>>serialized but
> >>>>>>>>>      >>>>per engine.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> So you could have something like this :
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
> >>>>>>>>>      out_fence=fence2)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
> >>>>>>>>>      out_fence=fence4)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> fence1 is not signaled
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> fence3 is signaled
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
> >>>>>>>>>      >>>>first VM_BIND.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
> >>>>>>>>>      >>>>userspace by doing
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   >       wait
> >>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
> >>>>>>>>>fences useless.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
> >>>>>>>>>rework this or
> >>>>>>>>>      just
> >>>>>>>>>      >>>>   deal with
> >>>>>>>>>      >>>>   >       wait
> >>>>>>>>>      >>>>   > >> fences in userspace?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   >       >My opinion is rework this but make the
> >>>>>>>>>ordering via
> >>>>>>>>>      >>>>an engine
> >>>>>>>>>      >>>>   param
> >>>>>>>>>      >>>>   > optional.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> >>>>>>>>>are ordered
> >>>>>>>>>      >>>>within the
> >>>>>>>>>      >>>>   VM
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> >>>>>>>>>accept an
> >>>>>>>>>      engine
> >>>>>>>>>      >>>>   argument
> >>>>>>>>>      >>>>   >       (in
> >>>>>>>>>      >>>>   > >the case of the i915 likely this is a
> >>>>>>>>>gem context
> >>>>>>>>>      >>>>handle) and
> >>>>>>>>>      >>>>   binds
> >>>>>>>>>      >>>>   > >ordered with respect to that engine.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >This gives UMDs options as the later
> >>>>>>>>>likely consumes
> >>>>>>>>>      >>>>more KMD
> >>>>>>>>>      >>>>   > resources
> >>>>>>>>>      >>>>   >       >so if a different UMD can live with
> >>>>binds being
> >>>>>>>>>      >>>>ordered within
> >>>>>>>>>      >>>>   the VM
> >>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       I think we need to be careful here if we
> >>>>>>>>>are looking
> >>>>>>>>>      for some
> >>>>>>>>>      >>>>   out of
> >>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
> >>>>>>>>>      >>>>   > In-order completion means, in a batch of
> >>>>>>>>>binds and
> >>>>>>>>>      >>>>unbinds to be
> >>>>>>>>>      >>>>   > completed in-order, user only needs to specify
> >>>>>>>>>      >>>>in-fence for the
> >>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
> >>>>>>>>>for the last
> >>>>>>>>>      >>>>   bind/unbind
> >>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
> >>>>>>>>>call can be
> >>>>>>>>>      >>>>re-used by
> >>>>>>>>>      >>>>   >       any subsequent bind call in that
> >>>>in-order batch.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       These things will break if
> >>>>>>>>>binding/unbinding were to
> >>>>>>>>>      >>>>be allowed
> >>>>>>>>>      >>>>   to
> >>>>>>>>>      >>>>   >       go out of order (of submission) and user
> >>>>>>>>>need to be
> >>>>>>>>>      extra
> >>>>>>>>>      >>>>   careful
> >>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
> >>>>>>>>>out-fence and
> >>>>>>>>>      bind
> >>>>>>>>>      >>>>   failing
> >>>>>>>>>      >>>>   >       as VA is still in use etc.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided
> >>>>mapping on the
> >>>>>>>>>      specified
> >>>>>>>>>      >>>>   address
> >>>>>>>>>      >>>>   >       space
> >>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> >>>>>>>>>specific.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
> >>>>>>>>>which can be
> >>>>>>>>>      >>>>one from
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   > pre-defined queues,
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
> >>>>>>>>>      >>>>   >       ...
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
> >>>>>>>>>each queue which
> >>>>>>>>>      will
> >>>>>>>>>      >>>>   only
> >>>>>>>>>      >>>>   >       bind the mappings on that queue in the
> >>>>order of
> >>>>>>>>>      submission.
> >>>>>>>>>      >>>>   >       User can assign the queue to per engine
> >>>>>>>>>or anything
> >>>>>>>>>      >>>>like that.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       But again here, user need to be
> >>>>careful and not
> >>>>>>>>>      >>>>deadlock these
> >>>>>>>>>      >>>>   >       queues with circular dependency of fences.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       I prefer adding this later an as
> >>>>>>>>>extension based on
> >>>>>>>>>      >>>>whether it
> >>>>>>>>>      >>>>   >       is really helping with the implementation.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     I can tell you right now that having
> >>>>>>>>>everything on a
> >>>>>>>>>      single
> >>>>>>>>>      >>>>   in-order
> >>>>>>>>>      >>>>   >     queue will not get us the perf we want.
> >>>>>>>>>What vulkan
> >>>>>>>>>      >>>>really wants
> >>>>>>>>>      >>>>   is one
> >>>>>>>>>      >>>>   >     of two things:
> >>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND
> >>>>ops.  They just
> >>>>>>>>>      happen in
> >>>>>>>>>      >>>>   whatever
> >>>>>>>>>      >>>>   >     their dependencies are resolved and we
> >>>>>>>>>ensure ordering
> >>>>>>>>>      >>>>ourselves
> >>>>>>>>>      >>>>   by
> >>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
> >>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> >>>>>>>>>queues.  We
> >>>>>>>>>      need at
> >>>>>>>>>      >>>>   least 2
> >>>>>>>>>      >>>>   >     but I don't see why there needs to be a
> >>>>>>>>>limit besides
> >>>>>>>>>      >>>>the limits
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   >     i915 API already has on the number of
> >>>>>>>>>engines.  Vulkan
> >>>>>>>>>      could
> >>>>>>>>>      >>>>   expose
> >>>>>>>>>      >>>>   >     multiple sparse binding queues to the
> >>>>>>>>>client if it's not
> >>>>>>>>>      >>>>   arbitrarily
> >>>>>>>>>      >>>>   >     limited.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Thanks Jason, Lionel.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Jason, what are you referring to when you say
> >>>>>>>>>"limits the i915
> >>>>>>>>>      API
> >>>>>>>>>      >>>>   already
> >>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
> >>>>>>>>>there is such
> >>>>>>>>>      an uapi
> >>>>>>>>>      >>>>   today.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> There's a limit of something like 64 total engines
> >>>>>>>>>today based on
> >>>>>>>>>      the
> >>>>>>>>>      >>>> number of bits we can cram into the exec flags in
> >>>>>>>>>execbuffer2.  I
> >>>>>>>>>      think
> >>>>>>>>>      >>>> someone had an extended version that allowed more
> >>>>>>>>>but I ripped it
> >>>>>>>>>      out
> >>>>>>>>>      >>>> because no one was using it.  Of course,
> >>>>>>>>>execbuffer3 might not
> >>>>>>>>>      >>>>have that
> >>>>>>>>>      >>>> problem at all.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Thanks Jason.
> >>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> >>>>>>>>>execbuffer3
> >>>>>>>>>      probably
> >>>>>>>>>      >>>will not have this limiation. So, we need to define a
> >>>>>>>>>      VM_BIND_MAX_QUEUE
> >>>>>>>>>      >>>and somehow export it to user (I am thinking of
> >>>>>>>>>embedding it in
> >>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
> >>>>bits[1-3]->'n'
> >>>>>>>>>      meaning 2^n
> >>>>>>>>>      >>>queues.
> >>>>>>>>>      >>
> >>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> >>>>>>>>>(0x3f) which
> >>>>>>>>>      execbuf3
> >>>>>>>>>
> >>>>>>>>>    Yup!  That's exactly the limit I was talking about.
> >>>>>>>>>
> >>>>>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
> >>>>>>>>>      structures,
> >>>>>>>>>      >>
> >>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> >>>>>>>>>      >>        __u32 queue;
> >>>>>>>>>      >>
> >>>>>>>>>      >>I think that will keep things simple.
> >>>>>>>>>      >
> >>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how
> >>>>many engines
> >>>>>>>>>      >hardware can have? I suggest not to do that.
> >>>>>>>>>      >
> >>>>>>>>>      >Change with added this:
> >>>>>>>>>      >
> >>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >>>>>>>>>      >               return -EINVAL;
> >>>>>>>>>      >
> >>>>>>>>>      >To context creation needs to be undone and so let users
> >>>>>>>>>create engine
> >>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
> >>>>>>>>>them all.
> >>>>>>>>>      >
> >>>>>>>>>
> >>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> >>>>>>>>>execbuff3 also.
> >>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
> >>>>>>>>>(64, or 65 if we
> >>>>>>>>>      make it N+1).
> >>>>>>>>>      But, as discussed in other thread of this RFC series, we
> >>>>>>>>>are planning
> >>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,
> >>>>there won't be
> >>>>>>>>>      any uapi that limits the number of engines (and hence
> >>>>>>>>>the vm_bind
> >>>>>>>>>      queues
> >>>>>>>>>      need to be supported).
> >>>>>>>>>
> >>>>>>>>>      If we leave the number of vm_bind queues to be
> >>>>arbitrarily large
> >>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> >>>>>>>>>queue (a wq,
> >>>>>>>>>      work_item and a linked list) lookup from the user
> >>>>>>>>>specified queue
> >>>>>>>>>      index.
> >>>>>>>>>      Other option is to just put some hard limit (say 64 or
> >>>>>>>>>65) and use
> >>>>>>>>>      an array of queues in VM (each created upon first use).
> >>>>>>>>>I prefer this.
> >>>>>>>>>
> >>>>>>>>>    I don't get why a VM_BIND queue is any different from any
> >>>>>>>>>other queue or
> >>>>>>>>>    userspace-visible kernel object.  But I'll leave those
> >>>>>>>>>details up to
> >>>>>>>>>    danvet or whoever else might be reviewing the
> implementation.
> >>>>>>>>>    --Jason
> >>>>>>>>>
> >>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> >>>>>>>>>queue created
> >>>>>>>>>  like the others when we build the engine map?
> >>>>>>>>>
> >>>>>>>>>  For userspace it's then just matter of selecting the right
> >>>>>>>>>queue ID when
> >>>>>>>>>  submitting.
> >>>>>>>>>
> >>>>>>>>>  If there is ever a possibility to have this work on the GPU,
> >>>>>>>>>it would be
> >>>>>>>>>  all ready.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>I did sync offline with Matt Brost on this.
> >>>>>>>>We can add a VM_BIND engine class and let user create VM_BIND
> >>>>>>>>engines (queues).
> >>>>>>>>The problem is, in i915 engine creating interface is bound to
> >>>>>>>>gem_context.
> >>>>>>>>So, in vm_bind ioctl, we would need both context_id and
> >>>>>>>>queue_idx for proper
> >>>>>>>>lookup of the user created engine. This is bit ackward as
> >>>>vm_bind is an
> >>>>>>>>interface to VM (address space) and has nothing to do with
> >>>>gem_context.
> >>>>>>>
> >>>>>>>
> >>>>>>>A gem_context has a single vm object right?
> >>>>>>>
> >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a
> default
> >>>>>>>one if not.
> >>>>>>>
> >>>>>>>So it's just like picking up the vm like it's done at execbuffer
> >>>>>>>time right now : eb->context->vm
> >>>>>>>
> >>>>>>
> >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
> >>>>>>VM_BIND/UNBIND
> >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
> be
> >>>>>>obtained
> >>>>>>from the context?
> >>>>>
> >>>>>
> >>>>>Yes, because if we go for engines, they're associated with a context
> >>>>>and so also associated with the VM bound to the context.
> >>>>>
> >>>>
> >>>>Hmm...context doesn't sould like the right interface. It should be
> >>>>VM and engine (independent of context). Engine can be virtual or soft
> >>>>engine (kernel thread), each with its own queue. We can add an
> >>>>interface
> >>>>to create such engines (independent of context). But we are anway
> >>>>implicitly creating it when user uses a new queue_idx. If in future
> >>>>we have hardware engines for VM_BIND operation, we can have that
> >>>>explicit inteface to create engine instances and the queue_index
> >>>>in vm_bind/unbind will point to those engines.
> >>>>Anyone has any thoughts? Daniel?
> >>>
> >>>Exposing gem_context or intel_context to user space is a strange
> >>>concept to me. A context represent some hw resources that is used
> >>>to complete certain task. User space should care allocate some
> >>>resources (memory, queues) and submit tasks to queues. But user
> >>>space doesn't care how certain task is mapped to a HW context -
> >>>driver/guc should take care of this.
> >>>
> >>>So a cleaner interface to me is: user space create a vm,  create
> >>>gem object, vm_bind it to a vm; allocate queues (internally
> >>>represent compute or blitter HW. Queue can be virtual to user) for
> >>>this vm; submit tasks to queues. User can create multiple queues
> >>>under one vm. One queue is only for one vm.
> >>>
> >>>I915 driver/guc manage the hw compute or blitter resources which
> >>>is transparent to user space. When i915 or guc decide to schedule
> >>>a queue (run tasks on that queue), a HW engine will be pick up and
> >>>set up properly for the vm of that queue (ie., switch to page
> >>>tables of that vm) - this is a context switch.
> >>>
> >>>From vm_bind perspective, it simply bind a gem_object to a vm.
> >>>Engine/queue is not a parameter to vm_bind, as any engine can be
> >>>pick up by i915/guc to execute a task using the vm bound va.
> >>>
> >>>I didn't completely follow the discussion here. Just share some
> >>>thoughts.
> >>>
> >>
> >>Yah, I agree.
> >>
> >>Lionel,
> >>How about we define the queue as
> >>union {
> >>       __u32 queue_idx;
> >>       __u64 rsvd;
> >>}
> >>
> >>If required, we can extend by expanding the 'rsvd' field to <ctx_id,
> >>queue_idx> later
> >>with a flag.
> >>
> >>Niranjana
> >
> >
> >I did not really understand Oak's comment nor what you're suggesting
> >here to be honest.
> >
> >
> >First the GEM context is already exposed to userspace. It's explicitly
> >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
> >
> >We give the GEM context id in every execbuffer we do with
> >drm_i915_gem_execbuffer2::rsvd1.
> >
> >It's still in the new execbuffer3 proposal being discussed.
> >
> >
> >Second, the GEM context is also where we set the VM with
> >I915_CONTEXT_PARAM_VM.
> >
> >
> >Third, the GEM context also has the list of engines with
> >I915_CONTEXT_PARAM_ENGINES.
> >
> 
> Yes, the execbuf and engine map creation are tied to gem_context.
> (which probably is not the best interface.)
> 
> >
> >So it makes sense to me to dispatch the vm_bind operation to a GEM
> >context, to a given vm_bind queue, because it's got all the
> >information required :
> >
> >    - the list of new vm_bind queues
> >
> >    - the vm that is going to be modified
> >
> 
> But the operation is performed here on the address space (VM) which
> can have multiple gem_contexts referring to it. So, VM is the right
> interface here. We need not 'gem_context'ify it.
> 
> All we need is multiple queue support for the address space (VM).
> Going to gem_context for that just because we have engine creation
> support there seems unnecessay and not correct to me.
> 
> >
> >Otherwise where do the vm_bind queues live?
> >
> >In the i915/drm fd object?
> >
> >That would mean that all the GEM contexts are sharing the same vm_bind
> >queues.
> >
> 
> Not all, only the gem contexts that are using the same address space (VM).
> But to me the right way to describe would be that "VM will be using those
> queues".


I hope by "queue" here you mean a HW resource  that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has. 

To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).

I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.

Regards,
Oak

> 
> Niranjana
> 
> >
> >intel_context or GuC are internal details we're not concerned about.
> >
> >I don't really see the connection with the GEM context.
> >
> >
> >Maybe Oak has a different use case than Vulkan.
> >
> >
> >-Lionel
> >
> >
> >>
> >>>Regards,
> >>>Oak
> >>>
> >>>>
> >>>>Niranjana
> >>>>
> >>>>>
> >>>>>>I think the interface is clean as a interface to VM. It is
> >>>>only that we
> >>>>>>don't have a clean way to create a raw VM_BIND engine (not
> >>>>>>associated with
> >>>>>>any context) with i915 uapi.
> >>>>>>May be we can add such an interface, but I don't think that is
> >>>>worth it
> >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
> >>>>>>mentioned
> >>>>>>above).
> >>>>>>Anyone has any thoughts?
> >>>>>>
> >>>>>>>
> >>>>>>>>Another problem is, if two VMs are binding with the same defined
> >>>>>>>>engine,
> >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
> >>>>>>>>(which may be
> >>>>>>>>waiting on its in_fence).
> >>>>>>>
> >>>>>>>
> >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
> >>>>>>>with a single gem_context right now?
> >>>>>>>
> >>>>>>
> >>>>>>No, we don't have 2 VMs for a gem_context.
> >>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
> >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> >>>>>>those two queue indicies points to same underlying vm_bind engine,
> >>>>>>then the second vm_bind call gets blocked until the first
> >>>>vm_bind call's
> >>>>>>'in' fence is triggered and bind completes.
> >>>>>>
> >>>>>>With per VM queues, this is not a problem as two VMs will not endup
> >>>>>>sharing same queue.
> >>>>>>
> >>>>>>BTW, I just posted a updated PATCH series.
> >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
> >>>>>>
> >>>>>>Niranjana
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>So, my preference here is to just add a 'u32 queue' index in
> >>>>>>>>vm_bind/unbind
> >>>>>>>>ioctl, and the queues are per VM.
> >>>>>>>>
> >>>>>>>>Niranjana
> >>>>>>>>
> >>>>>>>>>  Thanks,
> >>>>>>>>>
> >>>>>>>>>  -Lionel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>      Niranjana
> >>>>>>>>>
> >>>>>>>>>      >Regards,
> >>>>>>>>>      >
> >>>>>>>>>      >Tvrtko
> >>>>>>>>>      >
> >>>>>>>>>      >>
> >>>>>>>>>      >>Niranjana
> >>>>>>>>>      >>
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>>   I am trying to see how many queues we need and
> >>>>>>>>>don't want it to
> >>>>>>>>>      be
> >>>>>>>>>      >>>>   arbitrarily
> >>>>>>>>>      >>>>   large and unduely blow up memory usage and
> >>>>>>>>>complexity in i915
> >>>>>>>>>      driver.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> >>>>>>>>>vast majority
> >>>>>>>>>      >>>>of cases. I
> >>>>>>>>>      >>>> could imagine a client wanting to create more
> >>>>than 1 sparse
> >>>>>>>>>      >>>>queue in which
> >>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> >>>>>>>>>complexity
> >>>>>>>>>      >>>>goes, once
> >>>>>>>>>      >>>> you allow two, I don't think the complexity is
> >>>>going up by
> >>>>>>>>>      >>>>allowing N.  As
> >>>>>>>>>      >>>> for memory usage, creating more queues means more
> >>>>>>>>>memory.  That's
> >>>>>>>>>      a
> >>>>>>>>>      >>>> trade-off that userspace can make. Again, the
> >>>>>>>>>expected number
> >>>>>>>>>      >>>>here is 1
> >>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
> >>>>>>>>>you need to
> >>>>>>>>>      worry.
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> >>>>>>>>>      >>>That would require us create 8 workqueues.
> >>>>>>>>>      >>>We can change 'n' later if required.
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Niranjana
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
> >>>>>>>>>      >>>>operations and we
> >>>>>>>>>      >>>>   don't
> >>>>>>>>>      >>>>   >     want any dependencies between them:
> >>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
> >>>>>>>>>creation or
> >>>>>>>>>      >>>>maybe as
> >>>>>>>>>      >>>>   part of
> >>>>>>>>>      >>>>   > vkBindImageMemory() or
> VkBindBufferMemory().  These
> >>>>>>>>>      >>>>don't happen
> >>>>>>>>>      >>>>   on a
> >>>>>>>>>      >>>>   >     queue and we don't want them serialized
> >>>>>>>>>with anything.       To
> >>>>>>>>>      >>>>   synchronize
> >>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
> >>>>>>>>>VkDevice which
> >>>>>>>>>      is
> >>>>>>>>>      >>>>   signaled by
> >>>>>>>>>      >>>>   >     all immediate bind operations and make
> >>>>>>>>>submits wait on
> >>>>>>>>>      it.
> >>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
> >>>>>>>>>VkQueue which may
> >>>>>>>>>      be the
> >>>>>>>>>      >>>>   same as
> >>>>>>>>>      >>>>   >     a render/compute queue or may be its own
> >>>>>>>>>queue.  It's up
> >>>>>>>>>      to us
> >>>>>>>>>      >>>>   what we
> >>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
> >>>>>>>>>PoV, this is like
> >>>>>>>>>      any
> >>>>>>>>>      >>>>   other
> >>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
> >>>>>>>>>semaphores.       If we
> >>>>>>>>>      >>>>   have a
> >>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to
> >>>>wait and
> >>>>>>>>>      >>>>signal just like
> >>>>>>>>>      >>>>   we do
> >>>>>>>>>      >>>>   >     in execbuf().
> >>>>>>>>>      >>>>   >     The important thing is that we don't want
> >>>>>>>>>one type of
> >>>>>>>>>      >>>>operation to
> >>>>>>>>>      >>>>   block
> >>>>>>>>>      >>>>   >     on the other.  If immediate binds are
> >>>>>>>>>blocking on sparse
> >>>>>>>>>      binds,
> >>>>>>>>>      >>>>   it's
> >>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
> >>>>>>>>>      >>>>   >     In terms of the internal implementation, I
> >>>>>>>>>know that
> >>>>>>>>>      >>>>there's going
> >>>>>>>>>      >>>>   to be
> >>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
> >>>>>>>>>do these
> >>>>>>>>>      things in
> >>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
> >>>>>>>>>      signaled and
> >>>>>>>>>      >>>>   we're
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND
> >>>>engine with
> >>>>>>>>>      >>>>multiple queues
> >>>>>>>>>      >>>>   feeding to it.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> Right.  As long as the queues themselves are
> >>>>>>>>>independent and
> >>>>>>>>>      >>>>can block on
> >>>>>>>>>      >>>> dma_fences without holding up other queues, I think
> >>>>>>>>>we're fine.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
> >>>>>>>>>      >>>>there's a bit
> >>>>>>>>>      >>>>   of
> >>>>>>>>>      >>>>   > synchronization due to locking.  That's
> >>>>>>>>>expected.  What
> >>>>>>>>>      >>>>we can't
> >>>>>>>>>      >>>>   afford
> >>>>>>>>>      >>>>   >     to have is an immediate bind operation
> >>>>>>>>>suddenly blocking
> >>>>>>>>>      on a
> >>>>>>>>>      >>>>   sparse
> >>>>>>>>>      >>>>   > operation which is blocked on a compute job
> >>>>>>>>>that's going
> >>>>>>>>>      to run
> >>>>>>>>>      >>>>   for
> >>>>>>>>>      >>>>   >     another 5ms.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one
> VM
> >>>>>>>>>doesn't block
> >>>>>>>>>      the
> >>>>>>>>>      >>>>   VM_BIND
> >>>>>>>>>      >>>>   on other VMs. I am not sure about usecases
> >>>>here, but just
> >>>>>>>>>      wanted to
> >>>>>>>>>      >>>>   clarify.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> Yes, that's what I would expect.
> >>>>>>>>>      >>>> --Jason
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Niranjana
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
> >>>>>>>>>      arbitrarily many
> >>>>>>>>>      >>>>   paging
> >>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
> >>>>>>>>>engine/queue).  That
> >>>>>>>>>      >>>>design works
> >>>>>>>>>      >>>>   >     pretty well and solves the problems in
> >>>>>>>>>question.       >>>>Again, we could
> >>>>>>>>>      >>>>   just
> >>>>>>>>>      >>>>   >     make everything out-of-order and require
> >>>>>>>>>using syncobjs
> >>>>>>>>>      >>>>to order
> >>>>>>>>>      >>>>   things
> >>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
> >>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
> >>>>>>>>>something on
> >>>>>>>>>      >>>>IRC about
> >>>>>>>>>      >>>>   VM_BIND
> >>>>>>>>>      >>>>   >     queues waiting for syncobjs to
> >>>>>>>>>materialize.  We don't
> >>>>>>>>>      really
> >>>>>>>>>      >>>>   want/need
> >>>>>>>>>      >>>>   >     this. We already have all the machinery in
> >>>>>>>>>userspace to
> >>>>>>>>>      handle
> >>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
> >>>>>>>>>fences to
> >>>>>>>>>      >>>>materialize
> >>>>>>>>>      >>>>   and
> >>>>>>>>>      >>>>   >     that machinery is on by default.  It
> >>>>would actually
> >>>>>>>>>      >>>>take MORE work
> >>>>>>>>>      >>>>   in
> >>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
> >>>>>>>>>the kernel
> >>>>>>>>>      >>>>being able to
> >>>>>>>>>      >>>>   wait
> >>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> >>>>>>>>>that right is
> >>>>>>>>>      >>>>   ridiculously
> >>>>>>>>>      >>>>   >     hard and I really don't want to get it
> >>>>>>>>>wrong in kernel
> >>>>>>>>>      >>>>space.   �� When we
> >>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
> >>>>>>>>>be a thing.  We
> >>>>>>>>>      don't
> >>>>>>>>>      >>>>   need to
> >>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
> >>>>>>>>>      >>>>   >     --Jason
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   Thanks Jason,
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> >>>>>>>>>we're allowed to
> >>>>>>>>>      have a
> >>>>>>>>>      >>>>   sparse
> >>>>>>>>>      >>>>   >   queue that does not implement either graphics
> >>>>>>>>>or compute
> >>>>>>>>>      >>>>operations
> >>>>>>>>>      >>>>   :
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     "While some implementations may include
> >>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
> >>>>>>>>>      >>>>   >     support in queue families that also include
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > graphics and compute support, other
> >>>>>>>>>implementations may
> >>>>>>>>>      only
> >>>>>>>>>      >>>>   expose a
> >>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > family."
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that
> >>>>just does
> >>>>>>>>>      bind/unbind
> >>>>>>>>>      >>>>   > operations.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   But yes we need another engine for the
> >>>>>>>>>immediate/non-sparse
> >>>>>>>>>      >>>>   operations.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   -Lionel
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >         >
> >>>>>>>>>      >>>>   > Daniel, any thoughts?
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > Niranjana
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >Matt
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Sorry I noticed this late.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> -Lionel
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>
> >>>>>>>
> >>>>>
> >
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-14 21:12                                                 ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-14 21:12 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, Landwerlin, Lionel G
  Cc: Intel GFX, Wilson, Chris P, Hellstrom, Thomas,
	Maling list - DRI developers, Vetter, Daniel,
	Christian König



Thanks,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 14, 2022 1:02 PM
> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Zeng, Oak <oak.zeng@intel.com>; Intel GFX <intel-
> gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Wilson, Chris P <chris.p.wilson@intel.com>;
> Vetter, Daniel <daniel.vetter@intel.com>; Christian König
> <christian.koenig@amd.com>
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> >>>
> >>>
> >>>Regards,
> >>>Oak
> >>>
> >>>>-----Original Message-----
> >>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On
> >>>>Behalf Of Niranjana
> >>>>Vishwanathapura
> >>>>Sent: June 10, 2022 1:43 PM
> >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> >>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list -
> >>>>DRI developers <dri-
> >>>>devel@lists.freedesktop.org>; Hellstrom, Thomas
> >>>><thomas.hellstrom@intel.com>;
> >>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> >>>><daniel.vetter@intel.com>; Christian König
> <christian.koenig@amd.com>
> >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> >>>>feature design
> >>>>document
> >>>>
> >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>>>>>
> >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>
> >>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> >>>>Ursulin wrote:
> >>>>>>>>>      >
> >>>>>>>>>      >
> >>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>>>>>Ekstrand wrote:
> >>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> >>>>Vishwanathapura
> >>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>>>>>Landwerlin
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM
> >>>>-0700, Matthew
> >>>>>>>>>      >>>>Brost wrote:
> >>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM
> >>>>+0300, Lionel
> >>>>>>>>>      Landwerlin
> >>>>>>>>>      >>>>   wrote:
> >>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>>>>>      >>>>   binding/unbinding
> >>>>>>>>>      >>>>   >       the mapping in an
> >>>>>>>>>      >>>>   > >> > +async worker. The binding and
> >>>>>>>>>unbinding will
> >>>>>>>>>      >>>>work like a
> >>>>>>>>>      >>>>   special
> >>>>>>>>>      >>>>   >       GPU engine.
> >>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
> >>>>>>>>>      serialized and
> >>>>>>>>>      >>>>   will
> >>>>>>>>>      >>>>   >       wait on specified
> >>>>>>>>>      >>>>   > >> > +input fences before the operation
> >>>>>>>>>and will signal
> >>>>>>>>>      the
> >>>>>>>>>      >>>>   output
> >>>>>>>>>      >>>>   >       fences upon the
> >>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
> >>>>>>>>>      serialization,
> >>>>>>>>>      >>>>   completion of
> >>>>>>>>>      >>>>   >       an operation
> >>>>>>>>>      >>>>   > >> > +will also indicate that all
> >>>>>>>>>previous operations
> >>>>>>>>>      >>>>are also
> >>>>>>>>>      >>>>   > complete.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
> >>>>>>>>>immediately
> >>>>>>>>>      start
> >>>>>>>>>      >>>>   > binding/unbinding" if
> >>>>>>>>>      >>>>   > >> there are fences involved.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
> >>>>>>>>>      >>>>worker seem to
> >>>>>>>>>      >>>>   imply
> >>>>>>>>>      >>>>   >       it's not
> >>>>>>>>>      >>>>   > >> immediate.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Ok, will fix.
> >>>>>>>>>      >>>>   >       This was added because in earlier design
> >>>>>>>>>binding was
> >>>>>>>>>      deferred
> >>>>>>>>>      >>>>   until
> >>>>>>>>>      >>>>   >       next execbuff.
> >>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
> >>>>>>>>>that sense).
> >>>>>>>>>      >>>>But yah,
> >>>>>>>>>      >>>>   this is
> >>>>>>>>>      >>>>   > confusing
> >>>>>>>>>      >>>>   >       and will fix it.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
> >>>>>>>>>      >>>>operation when
> >>>>>>>>>      >>>>   no
> >>>>>>>>>      >>>>   >       input fence
> >>>>>>>>>      >>>>   > >> is provided. Let say I do :
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In what order are the fences going to
> >>>>>>>>>be signaled?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
> >>>>>>>>>of order?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
> >>>>>>>>>it's : in
> >>>>>>>>>      order
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> >>>>>>>>>ioctls. Note that
> >>>>>>>>>      >>>>bind and
> >>>>>>>>>      >>>>   unbind
> >>>>>>>>>      >>>>   >       will use
> >>>>>>>>>      >>>>   >       the same queue and hence are ordered.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> One thing I didn't realize is that
> >>>>>>>>>because we only
> >>>>>>>>>      get one
> >>>>>>>>>      >>>>   > "VM_BIND" engine,
> >>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
> >>>>>>>>>specification.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
> >>>>>>>>>serialized but
> >>>>>>>>>      >>>>per engine.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> So you could have something like this :
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
> >>>>>>>>>      out_fence=fence2)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
> >>>>>>>>>      out_fence=fence4)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> fence1 is not signaled
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> fence3 is signaled
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
> >>>>>>>>>      >>>>first VM_BIND.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
> >>>>>>>>>      >>>>userspace by doing
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   >       wait
> >>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
> >>>>>>>>>fences useless.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
> >>>>>>>>>rework this or
> >>>>>>>>>      just
> >>>>>>>>>      >>>>   deal with
> >>>>>>>>>      >>>>   >       wait
> >>>>>>>>>      >>>>   > >> fences in userspace?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   >       >My opinion is rework this but make the
> >>>>>>>>>ordering via
> >>>>>>>>>      >>>>an engine
> >>>>>>>>>      >>>>   param
> >>>>>>>>>      >>>>   > optional.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> >>>>>>>>>are ordered
> >>>>>>>>>      >>>>within the
> >>>>>>>>>      >>>>   VM
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> >>>>>>>>>accept an
> >>>>>>>>>      engine
> >>>>>>>>>      >>>>   argument
> >>>>>>>>>      >>>>   >       (in
> >>>>>>>>>      >>>>   > >the case of the i915 likely this is a
> >>>>>>>>>gem context
> >>>>>>>>>      >>>>handle) and
> >>>>>>>>>      >>>>   binds
> >>>>>>>>>      >>>>   > >ordered with respect to that engine.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >This gives UMDs options as the later
> >>>>>>>>>likely consumes
> >>>>>>>>>      >>>>more KMD
> >>>>>>>>>      >>>>   > resources
> >>>>>>>>>      >>>>   >       >so if a different UMD can live with
> >>>>binds being
> >>>>>>>>>      >>>>ordered within
> >>>>>>>>>      >>>>   the VM
> >>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       I think we need to be careful here if we
> >>>>>>>>>are looking
> >>>>>>>>>      for some
> >>>>>>>>>      >>>>   out of
> >>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
> >>>>>>>>>      >>>>   > In-order completion means, in a batch of
> >>>>>>>>>binds and
> >>>>>>>>>      >>>>unbinds to be
> >>>>>>>>>      >>>>   > completed in-order, user only needs to specify
> >>>>>>>>>      >>>>in-fence for the
> >>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
> >>>>>>>>>for the last
> >>>>>>>>>      >>>>   bind/unbind
> >>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
> >>>>>>>>>call can be
> >>>>>>>>>      >>>>re-used by
> >>>>>>>>>      >>>>   >       any subsequent bind call in that
> >>>>in-order batch.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       These things will break if
> >>>>>>>>>binding/unbinding were to
> >>>>>>>>>      >>>>be allowed
> >>>>>>>>>      >>>>   to
> >>>>>>>>>      >>>>   >       go out of order (of submission) and user
> >>>>>>>>>need to be
> >>>>>>>>>      extra
> >>>>>>>>>      >>>>   careful
> >>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
> >>>>>>>>>out-fence and
> >>>>>>>>>      bind
> >>>>>>>>>      >>>>   failing
> >>>>>>>>>      >>>>   >       as VA is still in use etc.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided
> >>>>mapping on the
> >>>>>>>>>      specified
> >>>>>>>>>      >>>>   address
> >>>>>>>>>      >>>>   >       space
> >>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> >>>>>>>>>specific.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
> >>>>>>>>>which can be
> >>>>>>>>>      >>>>one from
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   > pre-defined queues,
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
> >>>>>>>>>      >>>>   >       ...
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
> >>>>>>>>>each queue which
> >>>>>>>>>      will
> >>>>>>>>>      >>>>   only
> >>>>>>>>>      >>>>   >       bind the mappings on that queue in the
> >>>>order of
> >>>>>>>>>      submission.
> >>>>>>>>>      >>>>   >       User can assign the queue to per engine
> >>>>>>>>>or anything
> >>>>>>>>>      >>>>like that.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       But again here, user need to be
> >>>>careful and not
> >>>>>>>>>      >>>>deadlock these
> >>>>>>>>>      >>>>   >       queues with circular dependency of fences.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       I prefer adding this later an as
> >>>>>>>>>extension based on
> >>>>>>>>>      >>>>whether it
> >>>>>>>>>      >>>>   >       is really helping with the implementation.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     I can tell you right now that having
> >>>>>>>>>everything on a
> >>>>>>>>>      single
> >>>>>>>>>      >>>>   in-order
> >>>>>>>>>      >>>>   >     queue will not get us the perf we want.
> >>>>>>>>>What vulkan
> >>>>>>>>>      >>>>really wants
> >>>>>>>>>      >>>>   is one
> >>>>>>>>>      >>>>   >     of two things:
> >>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND
> >>>>ops.  They just
> >>>>>>>>>      happen in
> >>>>>>>>>      >>>>   whatever
> >>>>>>>>>      >>>>   >     their dependencies are resolved and we
> >>>>>>>>>ensure ordering
> >>>>>>>>>      >>>>ourselves
> >>>>>>>>>      >>>>   by
> >>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
> >>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> >>>>>>>>>queues.  We
> >>>>>>>>>      need at
> >>>>>>>>>      >>>>   least 2
> >>>>>>>>>      >>>>   >     but I don't see why there needs to be a
> >>>>>>>>>limit besides
> >>>>>>>>>      >>>>the limits
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   >     i915 API already has on the number of
> >>>>>>>>>engines.  Vulkan
> >>>>>>>>>      could
> >>>>>>>>>      >>>>   expose
> >>>>>>>>>      >>>>   >     multiple sparse binding queues to the
> >>>>>>>>>client if it's not
> >>>>>>>>>      >>>>   arbitrarily
> >>>>>>>>>      >>>>   >     limited.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Thanks Jason, Lionel.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Jason, what are you referring to when you say
> >>>>>>>>>"limits the i915
> >>>>>>>>>      API
> >>>>>>>>>      >>>>   already
> >>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
> >>>>>>>>>there is such
> >>>>>>>>>      an uapi
> >>>>>>>>>      >>>>   today.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> There's a limit of something like 64 total engines
> >>>>>>>>>today based on
> >>>>>>>>>      the
> >>>>>>>>>      >>>> number of bits we can cram into the exec flags in
> >>>>>>>>>execbuffer2.  I
> >>>>>>>>>      think
> >>>>>>>>>      >>>> someone had an extended version that allowed more
> >>>>>>>>>but I ripped it
> >>>>>>>>>      out
> >>>>>>>>>      >>>> because no one was using it.  Of course,
> >>>>>>>>>execbuffer3 might not
> >>>>>>>>>      >>>>have that
> >>>>>>>>>      >>>> problem at all.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Thanks Jason.
> >>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> >>>>>>>>>execbuffer3
> >>>>>>>>>      probably
> >>>>>>>>>      >>>will not have this limiation. So, we need to define a
> >>>>>>>>>      VM_BIND_MAX_QUEUE
> >>>>>>>>>      >>>and somehow export it to user (I am thinking of
> >>>>>>>>>embedding it in
> >>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
> >>>>bits[1-3]->'n'
> >>>>>>>>>      meaning 2^n
> >>>>>>>>>      >>>queues.
> >>>>>>>>>      >>
> >>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> >>>>>>>>>(0x3f) which
> >>>>>>>>>      execbuf3
> >>>>>>>>>
> >>>>>>>>>    Yup!  That's exactly the limit I was talking about.
> >>>>>>>>>
> >>>>>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
> >>>>>>>>>      structures,
> >>>>>>>>>      >>
> >>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> >>>>>>>>>      >>        __u32 queue;
> >>>>>>>>>      >>
> >>>>>>>>>      >>I think that will keep things simple.
> >>>>>>>>>      >
> >>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how
> >>>>many engines
> >>>>>>>>>      >hardware can have? I suggest not to do that.
> >>>>>>>>>      >
> >>>>>>>>>      >Change with added this:
> >>>>>>>>>      >
> >>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >>>>>>>>>      >               return -EINVAL;
> >>>>>>>>>      >
> >>>>>>>>>      >To context creation needs to be undone and so let users
> >>>>>>>>>create engine
> >>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
> >>>>>>>>>them all.
> >>>>>>>>>      >
> >>>>>>>>>
> >>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> >>>>>>>>>execbuff3 also.
> >>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
> >>>>>>>>>(64, or 65 if we
> >>>>>>>>>      make it N+1).
> >>>>>>>>>      But, as discussed in other thread of this RFC series, we
> >>>>>>>>>are planning
> >>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,
> >>>>there won't be
> >>>>>>>>>      any uapi that limits the number of engines (and hence
> >>>>>>>>>the vm_bind
> >>>>>>>>>      queues
> >>>>>>>>>      need to be supported).
> >>>>>>>>>
> >>>>>>>>>      If we leave the number of vm_bind queues to be
> >>>>arbitrarily large
> >>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> >>>>>>>>>queue (a wq,
> >>>>>>>>>      work_item and a linked list) lookup from the user
> >>>>>>>>>specified queue
> >>>>>>>>>      index.
> >>>>>>>>>      Other option is to just put some hard limit (say 64 or
> >>>>>>>>>65) and use
> >>>>>>>>>      an array of queues in VM (each created upon first use).
> >>>>>>>>>I prefer this.
> >>>>>>>>>
> >>>>>>>>>    I don't get why a VM_BIND queue is any different from any
> >>>>>>>>>other queue or
> >>>>>>>>>    userspace-visible kernel object.  But I'll leave those
> >>>>>>>>>details up to
> >>>>>>>>>    danvet or whoever else might be reviewing the
> implementation.
> >>>>>>>>>    --Jason
> >>>>>>>>>
> >>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> >>>>>>>>>queue created
> >>>>>>>>>  like the others when we build the engine map?
> >>>>>>>>>
> >>>>>>>>>  For userspace it's then just matter of selecting the right
> >>>>>>>>>queue ID when
> >>>>>>>>>  submitting.
> >>>>>>>>>
> >>>>>>>>>  If there is ever a possibility to have this work on the GPU,
> >>>>>>>>>it would be
> >>>>>>>>>  all ready.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>I did sync offline with Matt Brost on this.
> >>>>>>>>We can add a VM_BIND engine class and let user create VM_BIND
> >>>>>>>>engines (queues).
> >>>>>>>>The problem is, in i915 engine creating interface is bound to
> >>>>>>>>gem_context.
> >>>>>>>>So, in vm_bind ioctl, we would need both context_id and
> >>>>>>>>queue_idx for proper
> >>>>>>>>lookup of the user created engine. This is bit ackward as
> >>>>vm_bind is an
> >>>>>>>>interface to VM (address space) and has nothing to do with
> >>>>gem_context.
> >>>>>>>
> >>>>>>>
> >>>>>>>A gem_context has a single vm object right?
> >>>>>>>
> >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a
> default
> >>>>>>>one if not.
> >>>>>>>
> >>>>>>>So it's just like picking up the vm like it's done at execbuffer
> >>>>>>>time right now : eb->context->vm
> >>>>>>>
> >>>>>>
> >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
> >>>>>>VM_BIND/UNBIND
> >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
> be
> >>>>>>obtained
> >>>>>>from the context?
> >>>>>
> >>>>>
> >>>>>Yes, because if we go for engines, they're associated with a context
> >>>>>and so also associated with the VM bound to the context.
> >>>>>
> >>>>
> >>>>Hmm...context doesn't sould like the right interface. It should be
> >>>>VM and engine (independent of context). Engine can be virtual or soft
> >>>>engine (kernel thread), each with its own queue. We can add an
> >>>>interface
> >>>>to create such engines (independent of context). But we are anway
> >>>>implicitly creating it when user uses a new queue_idx. If in future
> >>>>we have hardware engines for VM_BIND operation, we can have that
> >>>>explicit inteface to create engine instances and the queue_index
> >>>>in vm_bind/unbind will point to those engines.
> >>>>Anyone has any thoughts? Daniel?
> >>>
> >>>Exposing gem_context or intel_context to user space is a strange
> >>>concept to me. A context represent some hw resources that is used
> >>>to complete certain task. User space should care allocate some
> >>>resources (memory, queues) and submit tasks to queues. But user
> >>>space doesn't care how certain task is mapped to a HW context -
> >>>driver/guc should take care of this.
> >>>
> >>>So a cleaner interface to me is: user space create a vm,  create
> >>>gem object, vm_bind it to a vm; allocate queues (internally
> >>>represent compute or blitter HW. Queue can be virtual to user) for
> >>>this vm; submit tasks to queues. User can create multiple queues
> >>>under one vm. One queue is only for one vm.
> >>>
> >>>I915 driver/guc manage the hw compute or blitter resources which
> >>>is transparent to user space. When i915 or guc decide to schedule
> >>>a queue (run tasks on that queue), a HW engine will be pick up and
> >>>set up properly for the vm of that queue (ie., switch to page
> >>>tables of that vm) - this is a context switch.
> >>>
> >>>From vm_bind perspective, it simply bind a gem_object to a vm.
> >>>Engine/queue is not a parameter to vm_bind, as any engine can be
> >>>pick up by i915/guc to execute a task using the vm bound va.
> >>>
> >>>I didn't completely follow the discussion here. Just share some
> >>>thoughts.
> >>>
> >>
> >>Yah, I agree.
> >>
> >>Lionel,
> >>How about we define the queue as
> >>union {
> >>       __u32 queue_idx;
> >>       __u64 rsvd;
> >>}
> >>
> >>If required, we can extend by expanding the 'rsvd' field to <ctx_id,
> >>queue_idx> later
> >>with a flag.
> >>
> >>Niranjana
> >
> >
> >I did not really understand Oak's comment nor what you're suggesting
> >here to be honest.
> >
> >
> >First the GEM context is already exposed to userspace. It's explicitly
> >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
> >
> >We give the GEM context id in every execbuffer we do with
> >drm_i915_gem_execbuffer2::rsvd1.
> >
> >It's still in the new execbuffer3 proposal being discussed.
> >
> >
> >Second, the GEM context is also where we set the VM with
> >I915_CONTEXT_PARAM_VM.
> >
> >
> >Third, the GEM context also has the list of engines with
> >I915_CONTEXT_PARAM_ENGINES.
> >
> 
> Yes, the execbuf and engine map creation are tied to gem_context.
> (which probably is not the best interface.)
> 
> >
> >So it makes sense to me to dispatch the vm_bind operation to a GEM
> >context, to a given vm_bind queue, because it's got all the
> >information required :
> >
> >    - the list of new vm_bind queues
> >
> >    - the vm that is going to be modified
> >
> 
> But the operation is performed here on the address space (VM) which
> can have multiple gem_contexts referring to it. So, VM is the right
> interface here. We need not 'gem_context'ify it.
> 
> All we need is multiple queue support for the address space (VM).
> Going to gem_context for that just because we have engine creation
> support there seems unnecessay and not correct to me.
> 
> >
> >Otherwise where do the vm_bind queues live?
> >
> >In the i915/drm fd object?
> >
> >That would mean that all the GEM contexts are sharing the same vm_bind
> >queues.
> >
> 
> Not all, only the gem contexts that are using the same address space (VM).
> But to me the right way to describe would be that "VM will be using those
> queues".


I hope by "queue" here you mean a HW resource  that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has. 

To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).

I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.

Regards,
Oak

> 
> Niranjana
> 
> >
> >intel_context or GuC are internal details we're not concerned about.
> >
> >I don't really see the connection with the GEM context.
> >
> >
> >Maybe Oak has a different use case than Vulkan.
> >
> >
> >-Lionel
> >
> >
> >>
> >>>Regards,
> >>>Oak
> >>>
> >>>>
> >>>>Niranjana
> >>>>
> >>>>>
> >>>>>>I think the interface is clean as a interface to VM. It is
> >>>>only that we
> >>>>>>don't have a clean way to create a raw VM_BIND engine (not
> >>>>>>associated with
> >>>>>>any context) with i915 uapi.
> >>>>>>May be we can add such an interface, but I don't think that is
> >>>>worth it
> >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
> >>>>>>mentioned
> >>>>>>above).
> >>>>>>Anyone has any thoughts?
> >>>>>>
> >>>>>>>
> >>>>>>>>Another problem is, if two VMs are binding with the same defined
> >>>>>>>>engine,
> >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
> >>>>>>>>(which may be
> >>>>>>>>waiting on its in_fence).
> >>>>>>>
> >>>>>>>
> >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
> >>>>>>>with a single gem_context right now?
> >>>>>>>
> >>>>>>
> >>>>>>No, we don't have 2 VMs for a gem_context.
> >>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
> >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> >>>>>>those two queue indicies points to same underlying vm_bind engine,
> >>>>>>then the second vm_bind call gets blocked until the first
> >>>>vm_bind call's
> >>>>>>'in' fence is triggered and bind completes.
> >>>>>>
> >>>>>>With per VM queues, this is not a problem as two VMs will not endup
> >>>>>>sharing same queue.
> >>>>>>
> >>>>>>BTW, I just posted a updated PATCH series.
> >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
> >>>>>>
> >>>>>>Niranjana
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>So, my preference here is to just add a 'u32 queue' index in
> >>>>>>>>vm_bind/unbind
> >>>>>>>>ioctl, and the queues are per VM.
> >>>>>>>>
> >>>>>>>>Niranjana
> >>>>>>>>
> >>>>>>>>>  Thanks,
> >>>>>>>>>
> >>>>>>>>>  -Lionel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>      Niranjana
> >>>>>>>>>
> >>>>>>>>>      >Regards,
> >>>>>>>>>      >
> >>>>>>>>>      >Tvrtko
> >>>>>>>>>      >
> >>>>>>>>>      >>
> >>>>>>>>>      >>Niranjana
> >>>>>>>>>      >>
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>>   I am trying to see how many queues we need and
> >>>>>>>>>don't want it to
> >>>>>>>>>      be
> >>>>>>>>>      >>>>   arbitrarily
> >>>>>>>>>      >>>>   large and unduely blow up memory usage and
> >>>>>>>>>complexity in i915
> >>>>>>>>>      driver.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> >>>>>>>>>vast majority
> >>>>>>>>>      >>>>of cases. I
> >>>>>>>>>      >>>> could imagine a client wanting to create more
> >>>>than 1 sparse
> >>>>>>>>>      >>>>queue in which
> >>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> >>>>>>>>>complexity
> >>>>>>>>>      >>>>goes, once
> >>>>>>>>>      >>>> you allow two, I don't think the complexity is
> >>>>going up by
> >>>>>>>>>      >>>>allowing N.  As
> >>>>>>>>>      >>>> for memory usage, creating more queues means more
> >>>>>>>>>memory.  That's
> >>>>>>>>>      a
> >>>>>>>>>      >>>> trade-off that userspace can make. Again, the
> >>>>>>>>>expected number
> >>>>>>>>>      >>>>here is 1
> >>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
> >>>>>>>>>you need to
> >>>>>>>>>      worry.
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> >>>>>>>>>      >>>That would require us create 8 workqueues.
> >>>>>>>>>      >>>We can change 'n' later if required.
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Niranjana
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
> >>>>>>>>>      >>>>operations and we
> >>>>>>>>>      >>>>   don't
> >>>>>>>>>      >>>>   >     want any dependencies between them:
> >>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
> >>>>>>>>>creation or
> >>>>>>>>>      >>>>maybe as
> >>>>>>>>>      >>>>   part of
> >>>>>>>>>      >>>>   > vkBindImageMemory() or
> VkBindBufferMemory().  These
> >>>>>>>>>      >>>>don't happen
> >>>>>>>>>      >>>>   on a
> >>>>>>>>>      >>>>   >     queue and we don't want them serialized
> >>>>>>>>>with anything.       To
> >>>>>>>>>      >>>>   synchronize
> >>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
> >>>>>>>>>VkDevice which
> >>>>>>>>>      is
> >>>>>>>>>      >>>>   signaled by
> >>>>>>>>>      >>>>   >     all immediate bind operations and make
> >>>>>>>>>submits wait on
> >>>>>>>>>      it.
> >>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
> >>>>>>>>>VkQueue which may
> >>>>>>>>>      be the
> >>>>>>>>>      >>>>   same as
> >>>>>>>>>      >>>>   >     a render/compute queue or may be its own
> >>>>>>>>>queue.  It's up
> >>>>>>>>>      to us
> >>>>>>>>>      >>>>   what we
> >>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
> >>>>>>>>>PoV, this is like
> >>>>>>>>>      any
> >>>>>>>>>      >>>>   other
> >>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
> >>>>>>>>>semaphores.       If we
> >>>>>>>>>      >>>>   have a
> >>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to
> >>>>wait and
> >>>>>>>>>      >>>>signal just like
> >>>>>>>>>      >>>>   we do
> >>>>>>>>>      >>>>   >     in execbuf().
> >>>>>>>>>      >>>>   >     The important thing is that we don't want
> >>>>>>>>>one type of
> >>>>>>>>>      >>>>operation to
> >>>>>>>>>      >>>>   block
> >>>>>>>>>      >>>>   >     on the other.  If immediate binds are
> >>>>>>>>>blocking on sparse
> >>>>>>>>>      binds,
> >>>>>>>>>      >>>>   it's
> >>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
> >>>>>>>>>      >>>>   >     In terms of the internal implementation, I
> >>>>>>>>>know that
> >>>>>>>>>      >>>>there's going
> >>>>>>>>>      >>>>   to be
> >>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
> >>>>>>>>>do these
> >>>>>>>>>      things in
> >>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
> >>>>>>>>>      signaled and
> >>>>>>>>>      >>>>   we're
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND
> >>>>engine with
> >>>>>>>>>      >>>>multiple queues
> >>>>>>>>>      >>>>   feeding to it.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> Right.  As long as the queues themselves are
> >>>>>>>>>independent and
> >>>>>>>>>      >>>>can block on
> >>>>>>>>>      >>>> dma_fences without holding up other queues, I think
> >>>>>>>>>we're fine.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
> >>>>>>>>>      >>>>there's a bit
> >>>>>>>>>      >>>>   of
> >>>>>>>>>      >>>>   > synchronization due to locking.  That's
> >>>>>>>>>expected.  What
> >>>>>>>>>      >>>>we can't
> >>>>>>>>>      >>>>   afford
> >>>>>>>>>      >>>>   >     to have is an immediate bind operation
> >>>>>>>>>suddenly blocking
> >>>>>>>>>      on a
> >>>>>>>>>      >>>>   sparse
> >>>>>>>>>      >>>>   > operation which is blocked on a compute job
> >>>>>>>>>that's going
> >>>>>>>>>      to run
> >>>>>>>>>      >>>>   for
> >>>>>>>>>      >>>>   >     another 5ms.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one
> VM
> >>>>>>>>>doesn't block
> >>>>>>>>>      the
> >>>>>>>>>      >>>>   VM_BIND
> >>>>>>>>>      >>>>   on other VMs. I am not sure about usecases
> >>>>here, but just
> >>>>>>>>>      wanted to
> >>>>>>>>>      >>>>   clarify.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> Yes, that's what I would expect.
> >>>>>>>>>      >>>> --Jason
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Niranjana
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
> >>>>>>>>>      arbitrarily many
> >>>>>>>>>      >>>>   paging
> >>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
> >>>>>>>>>engine/queue).  That
> >>>>>>>>>      >>>>design works
> >>>>>>>>>      >>>>   >     pretty well and solves the problems in
> >>>>>>>>>question.       >>>>Again, we could
> >>>>>>>>>      >>>>   just
> >>>>>>>>>      >>>>   >     make everything out-of-order and require
> >>>>>>>>>using syncobjs
> >>>>>>>>>      >>>>to order
> >>>>>>>>>      >>>>   things
> >>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
> >>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
> >>>>>>>>>something on
> >>>>>>>>>      >>>>IRC about
> >>>>>>>>>      >>>>   VM_BIND
> >>>>>>>>>      >>>>   >     queues waiting for syncobjs to
> >>>>>>>>>materialize.  We don't
> >>>>>>>>>      really
> >>>>>>>>>      >>>>   want/need
> >>>>>>>>>      >>>>   >     this. We already have all the machinery in
> >>>>>>>>>userspace to
> >>>>>>>>>      handle
> >>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
> >>>>>>>>>fences to
> >>>>>>>>>      >>>>materialize
> >>>>>>>>>      >>>>   and
> >>>>>>>>>      >>>>   >     that machinery is on by default.  It
> >>>>would actually
> >>>>>>>>>      >>>>take MORE work
> >>>>>>>>>      >>>>   in
> >>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
> >>>>>>>>>the kernel
> >>>>>>>>>      >>>>being able to
> >>>>>>>>>      >>>>   wait
> >>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> >>>>>>>>>that right is
> >>>>>>>>>      >>>>   ridiculously
> >>>>>>>>>      >>>>   >     hard and I really don't want to get it
> >>>>>>>>>wrong in kernel
> >>>>>>>>>      >>>>space.   �� When we
> >>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
> >>>>>>>>>be a thing.  We
> >>>>>>>>>      don't
> >>>>>>>>>      >>>>   need to
> >>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
> >>>>>>>>>      >>>>   >     --Jason
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   Thanks Jason,
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> >>>>>>>>>we're allowed to
> >>>>>>>>>      have a
> >>>>>>>>>      >>>>   sparse
> >>>>>>>>>      >>>>   >   queue that does not implement either graphics
> >>>>>>>>>or compute
> >>>>>>>>>      >>>>operations
> >>>>>>>>>      >>>>   :
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     "While some implementations may include
> >>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
> >>>>>>>>>      >>>>   >     support in queue families that also include
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > graphics and compute support, other
> >>>>>>>>>implementations may
> >>>>>>>>>      only
> >>>>>>>>>      >>>>   expose a
> >>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > family."
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that
> >>>>just does
> >>>>>>>>>      bind/unbind
> >>>>>>>>>      >>>>   > operations.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   But yes we need another engine for the
> >>>>>>>>>immediate/non-sparse
> >>>>>>>>>      >>>>   operations.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   -Lionel
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >         >
> >>>>>>>>>      >>>>   > Daniel, any thoughts?
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > Niranjana
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >Matt
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Sorry I noticed this late.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> -Lionel
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>
> >>>>>>>
> >>>>>
> >
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-14 21:12                                                 ` Zeng, Oak
@ 2022-06-14 21:47                                                   ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-14 21:47 UTC (permalink / raw)
  To: Zeng, Oak, Vishwanathapura, Niranjana, Landwerlin, Lionel G
  Cc: Intel GFX, Maling list - DRI developers, Hellstrom, Thomas,
	Wilson, Chris P, Vetter,  Daniel, Christian König



Thanks,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Zeng, Oak
> Sent: June 14, 2022 5:13 PM
> To: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>;
> Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Wilson, Chris P
> <chris.p.wilson@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Vetter, Daniel <daniel.vetter@intel.com>;
> Christian König <christian.koenig@amd.com>
> Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> 
> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> > Sent: June 14, 2022 1:02 PM
> > To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> > Cc: Zeng, Oak <oak.zeng@intel.com>; Intel GFX <intel-
> > gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> > devel@lists.freedesktop.org>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Wilson, Chris P
> <chris.p.wilson@intel.com>;
> > Vetter, Daniel <daniel.vetter@intel.com>; Christian König
> > <christian.koenig@amd.com>
> > Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> > document
> >
> > On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> > >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> > >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> > >>>
> > >>>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>-----Original Message-----
> > >>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On
> > >>>>Behalf Of Niranjana
> > >>>>Vishwanathapura
> > >>>>Sent: June 10, 2022 1:43 PM
> > >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> > >>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list -
> > >>>>DRI developers <dri-
> > >>>>devel@lists.freedesktop.org>; Hellstrom, Thomas
> > >>>><thomas.hellstrom@intel.com>;
> > >>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> > >>>><daniel.vetter@intel.com>; Christian König
> > <christian.koenig@amd.com>
> > >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> > >>>>feature design
> > >>>>document
> > >>>>
> > >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> > >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> > >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> > >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> > >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin
> wrote:
> > >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> > >>>>>>>>>
> > >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> > >>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> > >>>>Ursulin wrote:
> > >>>>>>>>>      >
> > >>>>>>>>>      >
> > >>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> > >>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> > >>>>>>>>>Ekstrand wrote:
> > >>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> > >>>>Vishwanathapura
> > >>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> > >>>>>>>>>Landwerlin
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM
> > >>>>-0700, Matthew
> > >>>>>>>>>      >>>>Brost wrote:
> > >>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM
> > >>>>+0300, Lionel
> > >>>>>>>>>      Landwerlin
> > >>>>>>>>>      >>>>   wrote:
> > >>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana
> Vishwanathapura
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> > >>>>>>>>>      >>>>   binding/unbinding
> > >>>>>>>>>      >>>>   >       the mapping in an
> > >>>>>>>>>      >>>>   > >> > +async worker. The binding and
> > >>>>>>>>>unbinding will
> > >>>>>>>>>      >>>>work like a
> > >>>>>>>>>      >>>>   special
> > >>>>>>>>>      >>>>   >       GPU engine.
> > >>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
> > >>>>>>>>>      serialized and
> > >>>>>>>>>      >>>>   will
> > >>>>>>>>>      >>>>   >       wait on specified
> > >>>>>>>>>      >>>>   > >> > +input fences before the operation
> > >>>>>>>>>and will signal
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>>   output
> > >>>>>>>>>      >>>>   >       fences upon the
> > >>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
> > >>>>>>>>>      serialization,
> > >>>>>>>>>      >>>>   completion of
> > >>>>>>>>>      >>>>   >       an operation
> > >>>>>>>>>      >>>>   > >> > +will also indicate that all
> > >>>>>>>>>previous operations
> > >>>>>>>>>      >>>>are also
> > >>>>>>>>>      >>>>   > complete.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
> > >>>>>>>>>immediately
> > >>>>>>>>>      start
> > >>>>>>>>>      >>>>   > binding/unbinding" if
> > >>>>>>>>>      >>>>   > >> there are fences involved.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
> > >>>>>>>>>      >>>>worker seem to
> > >>>>>>>>>      >>>>   imply
> > >>>>>>>>>      >>>>   >       it's not
> > >>>>>>>>>      >>>>   > >> immediate.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Ok, will fix.
> > >>>>>>>>>      >>>>   >       This was added because in earlier design
> > >>>>>>>>>binding was
> > >>>>>>>>>      deferred
> > >>>>>>>>>      >>>>   until
> > >>>>>>>>>      >>>>   >       next execbuff.
> > >>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
> > >>>>>>>>>that sense).
> > >>>>>>>>>      >>>>But yah,
> > >>>>>>>>>      >>>>   this is
> > >>>>>>>>>      >>>>   > confusing
> > >>>>>>>>>      >>>>   >       and will fix it.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
> > >>>>>>>>>      >>>>operation when
> > >>>>>>>>>      >>>>   no
> > >>>>>>>>>      >>>>   >       input fence
> > >>>>>>>>>      >>>>   > >> is provided. Let say I do :
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In what order are the fences going to
> > >>>>>>>>>be signaled?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
> > >>>>>>>>>of order?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
> > >>>>>>>>>it's : in
> > >>>>>>>>>      order
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> > >>>>>>>>>ioctls. Note that
> > >>>>>>>>>      >>>>bind and
> > >>>>>>>>>      >>>>   unbind
> > >>>>>>>>>      >>>>   >       will use
> > >>>>>>>>>      >>>>   >       the same queue and hence are ordered.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> One thing I didn't realize is that
> > >>>>>>>>>because we only
> > >>>>>>>>>      get one
> > >>>>>>>>>      >>>>   > "VM_BIND" engine,
> > >>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
> > >>>>>>>>>specification.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
> > >>>>>>>>>serialized but
> > >>>>>>>>>      >>>>per engine.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> So you could have something like this :
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
> > >>>>>>>>>      out_fence=fence2)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
> > >>>>>>>>>      out_fence=fence4)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> fence1 is not signaled
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> fence3 is signaled
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
> > >>>>>>>>>      >>>>first VM_BIND.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
> > >>>>>>>>>      >>>>userspace by doing
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   >       wait
> > >>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
> > >>>>>>>>>fences useless.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
> > >>>>>>>>>rework this or
> > >>>>>>>>>      just
> > >>>>>>>>>      >>>>   deal with
> > >>>>>>>>>      >>>>   >       wait
> > >>>>>>>>>      >>>>   > >> fences in userspace?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   >       >My opinion is rework this but make the
> > >>>>>>>>>ordering via
> > >>>>>>>>>      >>>>an engine
> > >>>>>>>>>      >>>>   param
> > >>>>>>>>>      >>>>   > optional.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> > >>>>>>>>>are ordered
> > >>>>>>>>>      >>>>within the
> > >>>>>>>>>      >>>>   VM
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> > >>>>>>>>>accept an
> > >>>>>>>>>      engine
> > >>>>>>>>>      >>>>   argument
> > >>>>>>>>>      >>>>   >       (in
> > >>>>>>>>>      >>>>   > >the case of the i915 likely this is a
> > >>>>>>>>>gem context
> > >>>>>>>>>      >>>>handle) and
> > >>>>>>>>>      >>>>   binds
> > >>>>>>>>>      >>>>   > >ordered with respect to that engine.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >This gives UMDs options as the later
> > >>>>>>>>>likely consumes
> > >>>>>>>>>      >>>>more KMD
> > >>>>>>>>>      >>>>   > resources
> > >>>>>>>>>      >>>>   >       >so if a different UMD can live with
> > >>>>binds being
> > >>>>>>>>>      >>>>ordered within
> > >>>>>>>>>      >>>>   the VM
> > >>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       I think we need to be careful here if we
> > >>>>>>>>>are looking
> > >>>>>>>>>      for some
> > >>>>>>>>>      >>>>   out of
> > >>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
> > >>>>>>>>>      >>>>   > In-order completion means, in a batch of
> > >>>>>>>>>binds and
> > >>>>>>>>>      >>>>unbinds to be
> > >>>>>>>>>      >>>>   > completed in-order, user only needs to specify
> > >>>>>>>>>      >>>>in-fence for the
> > >>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
> > >>>>>>>>>for the last
> > >>>>>>>>>      >>>>   bind/unbind
> > >>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
> > >>>>>>>>>call can be
> > >>>>>>>>>      >>>>re-used by
> > >>>>>>>>>      >>>>   >       any subsequent bind call in that
> > >>>>in-order batch.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       These things will break if
> > >>>>>>>>>binding/unbinding were to
> > >>>>>>>>>      >>>>be allowed
> > >>>>>>>>>      >>>>   to
> > >>>>>>>>>      >>>>   >       go out of order (of submission) and user
> > >>>>>>>>>need to be
> > >>>>>>>>>      extra
> > >>>>>>>>>      >>>>   careful
> > >>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
> > >>>>>>>>>out-fence and
> > >>>>>>>>>      bind
> > >>>>>>>>>      >>>>   failing
> > >>>>>>>>>      >>>>   >       as VA is still in use etc.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided
> > >>>>mapping on the
> > >>>>>>>>>      specified
> > >>>>>>>>>      >>>>   address
> > >>>>>>>>>      >>>>   >       space
> > >>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> > >>>>>>>>>specific.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
> > >>>>>>>>>which can be
> > >>>>>>>>>      >>>>one from
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   > pre-defined queues,
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
> > >>>>>>>>>      >>>>   >       ...
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
> > >>>>>>>>>each queue which
> > >>>>>>>>>      will
> > >>>>>>>>>      >>>>   only
> > >>>>>>>>>      >>>>   >       bind the mappings on that queue in the
> > >>>>order of
> > >>>>>>>>>      submission.
> > >>>>>>>>>      >>>>   >       User can assign the queue to per engine
> > >>>>>>>>>or anything
> > >>>>>>>>>      >>>>like that.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       But again here, user need to be
> > >>>>careful and not
> > >>>>>>>>>      >>>>deadlock these
> > >>>>>>>>>      >>>>   >       queues with circular dependency of fences.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       I prefer adding this later an as
> > >>>>>>>>>extension based on
> > >>>>>>>>>      >>>>whether it
> > >>>>>>>>>      >>>>   >       is really helping with the implementation.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     I can tell you right now that having
> > >>>>>>>>>everything on a
> > >>>>>>>>>      single
> > >>>>>>>>>      >>>>   in-order
> > >>>>>>>>>      >>>>   >     queue will not get us the perf we want.
> > >>>>>>>>>What vulkan
> > >>>>>>>>>      >>>>really wants
> > >>>>>>>>>      >>>>   is one
> > >>>>>>>>>      >>>>   >     of two things:
> > >>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND
> > >>>>ops.  They just
> > >>>>>>>>>      happen in
> > >>>>>>>>>      >>>>   whatever
> > >>>>>>>>>      >>>>   >     their dependencies are resolved and we
> > >>>>>>>>>ensure ordering
> > >>>>>>>>>      >>>>ourselves
> > >>>>>>>>>      >>>>   by
> > >>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
> > >>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> > >>>>>>>>>queues.  We
> > >>>>>>>>>      need at
> > >>>>>>>>>      >>>>   least 2
> > >>>>>>>>>      >>>>   >     but I don't see why there needs to be a
> > >>>>>>>>>limit besides
> > >>>>>>>>>      >>>>the limits
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   >     i915 API already has on the number of
> > >>>>>>>>>engines.  Vulkan
> > >>>>>>>>>      could
> > >>>>>>>>>      >>>>   expose
> > >>>>>>>>>      >>>>   >     multiple sparse binding queues to the
> > >>>>>>>>>client if it's not
> > >>>>>>>>>      >>>>   arbitrarily
> > >>>>>>>>>      >>>>   >     limited.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Thanks Jason, Lionel.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Jason, what are you referring to when you say
> > >>>>>>>>>"limits the i915
> > >>>>>>>>>      API
> > >>>>>>>>>      >>>>   already
> > >>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
> > >>>>>>>>>there is such
> > >>>>>>>>>      an uapi
> > >>>>>>>>>      >>>>   today.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> There's a limit of something like 64 total engines
> > >>>>>>>>>today based on
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>> number of bits we can cram into the exec flags in
> > >>>>>>>>>execbuffer2.  I
> > >>>>>>>>>      think
> > >>>>>>>>>      >>>> someone had an extended version that allowed more
> > >>>>>>>>>but I ripped it
> > >>>>>>>>>      out
> > >>>>>>>>>      >>>> because no one was using it.  Of course,
> > >>>>>>>>>execbuffer3 might not
> > >>>>>>>>>      >>>>have that
> > >>>>>>>>>      >>>> problem at all.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Thanks Jason.
> > >>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> > >>>>>>>>>execbuffer3
> > >>>>>>>>>      probably
> > >>>>>>>>>      >>>will not have this limiation. So, we need to define a
> > >>>>>>>>>      VM_BIND_MAX_QUEUE
> > >>>>>>>>>      >>>and somehow export it to user (I am thinking of
> > >>>>>>>>>embedding it in
> > >>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
> > >>>>bits[1-3]->'n'
> > >>>>>>>>>      meaning 2^n
> > >>>>>>>>>      >>>queues.
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> > >>>>>>>>>(0x3f) which
> > >>>>>>>>>      execbuf3
> > >>>>>>>>>
> > >>>>>>>>>    Yup!  That's exactly the limit I was talking about.
> > >>>>>>>>>
> > >>>>>>>>>      >>will also have. So, we can simply define in
> vm_bind/unbind
> > >>>>>>>>>      structures,
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> > >>>>>>>>>      >>        __u32 queue;
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>I think that will keep things simple.
> > >>>>>>>>>      >
> > >>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how
> > >>>>many engines
> > >>>>>>>>>      >hardware can have? I suggest not to do that.
> > >>>>>>>>>      >
> > >>>>>>>>>      >Change with added this:
> > >>>>>>>>>      >
> > >>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> > >>>>>>>>>      >               return -EINVAL;
> > >>>>>>>>>      >
> > >>>>>>>>>      >To context creation needs to be undone and so let users
> > >>>>>>>>>create engine
> > >>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
> > >>>>>>>>>them all.
> > >>>>>>>>>      >
> > >>>>>>>>>
> > >>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> > >>>>>>>>>execbuff3 also.
> > >>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
> > >>>>>>>>>(64, or 65 if we
> > >>>>>>>>>      make it N+1).
> > >>>>>>>>>      But, as discussed in other thread of this RFC series, we
> > >>>>>>>>>are planning
> > >>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,
> > >>>>there won't be
> > >>>>>>>>>      any uapi that limits the number of engines (and hence
> > >>>>>>>>>the vm_bind
> > >>>>>>>>>      queues
> > >>>>>>>>>      need to be supported).
> > >>>>>>>>>
> > >>>>>>>>>      If we leave the number of vm_bind queues to be
> > >>>>arbitrarily large
> > >>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> > >>>>>>>>>queue (a wq,
> > >>>>>>>>>      work_item and a linked list) lookup from the user
> > >>>>>>>>>specified queue
> > >>>>>>>>>      index.
> > >>>>>>>>>      Other option is to just put some hard limit (say 64 or
> > >>>>>>>>>65) and use
> > >>>>>>>>>      an array of queues in VM (each created upon first use).
> > >>>>>>>>>I prefer this.
> > >>>>>>>>>
> > >>>>>>>>>    I don't get why a VM_BIND queue is any different from any
> > >>>>>>>>>other queue or
> > >>>>>>>>>    userspace-visible kernel object.  But I'll leave those
> > >>>>>>>>>details up to
> > >>>>>>>>>    danvet or whoever else might be reviewing the
> > implementation.
> > >>>>>>>>>    --Jason
> > >>>>>>>>>
> > >>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> > >>>>>>>>>queue created
> > >>>>>>>>>  like the others when we build the engine map?
> > >>>>>>>>>
> > >>>>>>>>>  For userspace it's then just matter of selecting the right
> > >>>>>>>>>queue ID when
> > >>>>>>>>>  submitting.
> > >>>>>>>>>
> > >>>>>>>>>  If there is ever a possibility to have this work on the GPU,
> > >>>>>>>>>it would be
> > >>>>>>>>>  all ready.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>I did sync offline with Matt Brost on this.
> > >>>>>>>>We can add a VM_BIND engine class and let user create
> VM_BIND
> > >>>>>>>>engines (queues).
> > >>>>>>>>The problem is, in i915 engine creating interface is bound to
> > >>>>>>>>gem_context.
> > >>>>>>>>So, in vm_bind ioctl, we would need both context_id and
> > >>>>>>>>queue_idx for proper
> > >>>>>>>>lookup of the user created engine. This is bit ackward as
> > >>>>vm_bind is an
> > >>>>>>>>interface to VM (address space) and has nothing to do with
> > >>>>gem_context.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>A gem_context has a single vm object right?
> > >>>>>>>
> > >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a
> > default
> > >>>>>>>one if not.
> > >>>>>>>
> > >>>>>>>So it's just like picking up the vm like it's done at execbuffer
> > >>>>>>>time right now : eb->context->vm
> > >>>>>>>
> > >>>>>>
> > >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
> > >>>>>>VM_BIND/UNBIND
> > >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
> > be
> > >>>>>>obtained
> > >>>>>>from the context?
> > >>>>>
> > >>>>>
> > >>>>>Yes, because if we go for engines, they're associated with a context
> > >>>>>and so also associated with the VM bound to the context.
> > >>>>>
> > >>>>
> > >>>>Hmm...context doesn't sould like the right interface. It should be
> > >>>>VM and engine (independent of context). Engine can be virtual or soft
> > >>>>engine (kernel thread), each with its own queue. We can add an
> > >>>>interface
> > >>>>to create such engines (independent of context). But we are anway
> > >>>>implicitly creating it when user uses a new queue_idx. If in future
> > >>>>we have hardware engines for VM_BIND operation, we can have that
> > >>>>explicit inteface to create engine instances and the queue_index
> > >>>>in vm_bind/unbind will point to those engines.
> > >>>>Anyone has any thoughts? Daniel?
> > >>>
> > >>>Exposing gem_context or intel_context to user space is a strange
> > >>>concept to me. A context represent some hw resources that is used
> > >>>to complete certain task. User space should care allocate some
> > >>>resources (memory, queues) and submit tasks to queues. But user
> > >>>space doesn't care how certain task is mapped to a HW context -
> > >>>driver/guc should take care of this.
> > >>>
> > >>>So a cleaner interface to me is: user space create a vm,  create
> > >>>gem object, vm_bind it to a vm; allocate queues (internally
> > >>>represent compute or blitter HW. Queue can be virtual to user) for
> > >>>this vm; submit tasks to queues. User can create multiple queues
> > >>>under one vm. One queue is only for one vm.
> > >>>
> > >>>I915 driver/guc manage the hw compute or blitter resources which
> > >>>is transparent to user space. When i915 or guc decide to schedule
> > >>>a queue (run tasks on that queue), a HW engine will be pick up and
> > >>>set up properly for the vm of that queue (ie., switch to page
> > >>>tables of that vm) - this is a context switch.
> > >>>
> > >>>From vm_bind perspective, it simply bind a gem_object to a vm.
> > >>>Engine/queue is not a parameter to vm_bind, as any engine can be
> > >>>pick up by i915/guc to execute a task using the vm bound va.
> > >>>
> > >>>I didn't completely follow the discussion here. Just share some
> > >>>thoughts.
> > >>>
> > >>
> > >>Yah, I agree.
> > >>
> > >>Lionel,
> > >>How about we define the queue as
> > >>union {
> > >>       __u32 queue_idx;
> > >>       __u64 rsvd;
> > >>}
> > >>
> > >>If required, we can extend by expanding the 'rsvd' field to <ctx_id,
> > >>queue_idx> later
> > >>with a flag.
> > >>
> > >>Niranjana
> > >
> > >
> > >I did not really understand Oak's comment nor what you're suggesting
> > >here to be honest.
> > >
> > >
> > >First the GEM context is already exposed to userspace. It's explicitly
> > >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
> > >
> > >We give the GEM context id in every execbuffer we do with
> > >drm_i915_gem_execbuffer2::rsvd1.
> > >
> > >It's still in the new execbuffer3 proposal being discussed.
> > >
> > >
> > >Second, the GEM context is also where we set the VM with
> > >I915_CONTEXT_PARAM_VM.
> > >
> > >
> > >Third, the GEM context also has the list of engines with
> > >I915_CONTEXT_PARAM_ENGINES.
> > >
> >
> > Yes, the execbuf and engine map creation are tied to gem_context.
> > (which probably is not the best interface.)
> >
> > >
> > >So it makes sense to me to dispatch the vm_bind operation to a GEM
> > >context, to a given vm_bind queue, because it's got all the
> > >information required :
> > >
> > >    - the list of new vm_bind queues
> > >
> > >    - the vm that is going to be modified
> > >
> >
> > But the operation is performed here on the address space (VM) which
> > can have multiple gem_contexts referring to it. So, VM is the right
> > interface here. We need not 'gem_context'ify it.
> >
> > All we need is multiple queue support for the address space (VM).
> > Going to gem_context for that just because we have engine creation
> > support there seems unnecessay and not correct to me.
> >
> > >
> > >Otherwise where do the vm_bind queues live?
> > >
> > >In the i915/drm fd object?
> > >
> > >That would mean that all the GEM contexts are sharing the same vm_bind
> > >queues.
> > >
> >
> > Not all, only the gem contexts that are using the same address space (VM).
> > But to me the right way to describe would be that "VM will be using those
> > queues".
> 
> 
> I hope by "queue" here you mean a HW resource  that will be later used to
> execute the job, for example a ccs compute engine. Of course queue can be
> virtual so user can create more queues than what hw physically has.
> 
> To express the concept of "VM will be using those queues", I think it make
> sense to have create_queue(vm) function taking a vm parameter. This
> means this queue is created for the purpose of submit job under this VM.
> Later on, we can submit job (referring to objects vm_bound to the same vm)
> to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just
> vm_bind (object, va, vm).
> 
> I hope the "queue" here is not the engine used to perform the vm_bind
> operation itself. But if you meant a queue/engine to perform vm_bind itself
> (vs a queue/engine for later job submission), then we can discuss more. I
> know xe driver have similar concept and I think align the design early can
> benefit the migration to xe driver.

Oops, I read more on this thread and it turned out the vm_bind queue here is actually used to perform vm bind/unbind operations. XE driver has the similar concept (except it is called engine_id there). So having a queue_idx parameter is closer to xe design.

That said, I still feel having a queue_idx parameter to vm_bind is a bit awkward. Vm_bind can be performed without any GPU engines, ie,. CPU itself can complete a vm bind as long as CPU have access to gpu's local memory. So the queue here have to be a virtual concept - it doesn't have a hard map to GPU blitter engine.

Can someone summarize what is the benefit of the queue-idx parameter? For the purpose of ordering vm_bind and later gpu jobs?  

> 
> Regards,
> Oak
> 
> >
> > Niranjana
> >
> > >
> > >intel_context or GuC are internal details we're not concerned about.
> > >
> > >I don't really see the connection with the GEM context.
> > >
> > >
> > >Maybe Oak has a different use case than Vulkan.
> > >
> > >
> > >-Lionel
> > >
> > >
> > >>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>
> > >>>>Niranjana
> > >>>>
> > >>>>>
> > >>>>>>I think the interface is clean as a interface to VM. It is
> > >>>>only that we
> > >>>>>>don't have a clean way to create a raw VM_BIND engine (not
> > >>>>>>associated with
> > >>>>>>any context) with i915 uapi.
> > >>>>>>May be we can add such an interface, but I don't think that is
> > >>>>worth it
> > >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl
> as I
> > >>>>>>mentioned
> > >>>>>>above).
> > >>>>>>Anyone has any thoughts?
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>Another problem is, if two VMs are binding with the same
> defined
> > >>>>>>>>engine,
> > >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
> > >>>>>>>>(which may be
> > >>>>>>>>waiting on its in_fence).
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
> > >>>>>>>with a single gem_context right now?
> > >>>>>>>
> > >>>>>>
> > >>>>>>No, we don't have 2 VMs for a gem_context.
> > >>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
> > >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> > >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> > >>>>>>those two queue indicies points to same underlying vm_bind
> engine,
> > >>>>>>then the second vm_bind call gets blocked until the first
> > >>>>vm_bind call's
> > >>>>>>'in' fence is triggered and bind completes.
> > >>>>>>
> > >>>>>>With per VM queues, this is not a problem as two VMs will not
> endup
> > >>>>>>sharing same queue.
> > >>>>>>
> > >>>>>>BTW, I just posted a updated PATCH series.
> > >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
> > >>>>>>
> > >>>>>>Niranjana
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>So, my preference here is to just add a 'u32 queue' index in
> > >>>>>>>>vm_bind/unbind
> > >>>>>>>>ioctl, and the queues are per VM.
> > >>>>>>>>
> > >>>>>>>>Niranjana
> > >>>>>>>>
> > >>>>>>>>>  Thanks,
> > >>>>>>>>>
> > >>>>>>>>>  -Lionel
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>      Niranjana
> > >>>>>>>>>
> > >>>>>>>>>      >Regards,
> > >>>>>>>>>      >
> > >>>>>>>>>      >Tvrtko
> > >>>>>>>>>      >
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>Niranjana
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>>   I am trying to see how many queues we need and
> > >>>>>>>>>don't want it to
> > >>>>>>>>>      be
> > >>>>>>>>>      >>>>   arbitrarily
> > >>>>>>>>>      >>>>   large and unduely blow up memory usage and
> > >>>>>>>>>complexity in i915
> > >>>>>>>>>      driver.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> > >>>>>>>>>vast majority
> > >>>>>>>>>      >>>>of cases. I
> > >>>>>>>>>      >>>> could imagine a client wanting to create more
> > >>>>than 1 sparse
> > >>>>>>>>>      >>>>queue in which
> > >>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> > >>>>>>>>>complexity
> > >>>>>>>>>      >>>>goes, once
> > >>>>>>>>>      >>>> you allow two, I don't think the complexity is
> > >>>>going up by
> > >>>>>>>>>      >>>>allowing N.  As
> > >>>>>>>>>      >>>> for memory usage, creating more queues means more
> > >>>>>>>>>memory.  That's
> > >>>>>>>>>      a
> > >>>>>>>>>      >>>> trade-off that userspace can make. Again, the
> > >>>>>>>>>expected number
> > >>>>>>>>>      >>>>here is 1
> > >>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
> > >>>>>>>>>you need to
> > >>>>>>>>>      worry.
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> > >>>>>>>>>      >>>That would require us create 8 workqueues.
> > >>>>>>>>>      >>>We can change 'n' later if required.
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Niranjana
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
> > >>>>>>>>>      >>>>operations and we
> > >>>>>>>>>      >>>>   don't
> > >>>>>>>>>      >>>>   >     want any dependencies between them:
> > >>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
> > >>>>>>>>>creation or
> > >>>>>>>>>      >>>>maybe as
> > >>>>>>>>>      >>>>   part of
> > >>>>>>>>>      >>>>   > vkBindImageMemory() or
> > VkBindBufferMemory().  These
> > >>>>>>>>>      >>>>don't happen
> > >>>>>>>>>      >>>>   on a
> > >>>>>>>>>      >>>>   >     queue and we don't want them serialized
> > >>>>>>>>>with anything.       To
> > >>>>>>>>>      >>>>   synchronize
> > >>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
> > >>>>>>>>>VkDevice which
> > >>>>>>>>>      is
> > >>>>>>>>>      >>>>   signaled by
> > >>>>>>>>>      >>>>   >     all immediate bind operations and make
> > >>>>>>>>>submits wait on
> > >>>>>>>>>      it.
> > >>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
> > >>>>>>>>>VkQueue which may
> > >>>>>>>>>      be the
> > >>>>>>>>>      >>>>   same as
> > >>>>>>>>>      >>>>   >     a render/compute queue or may be its own
> > >>>>>>>>>queue.  It's up
> > >>>>>>>>>      to us
> > >>>>>>>>>      >>>>   what we
> > >>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
> > >>>>>>>>>PoV, this is like
> > >>>>>>>>>      any
> > >>>>>>>>>      >>>>   other
> > >>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
> > >>>>>>>>>semaphores.       If we
> > >>>>>>>>>      >>>>   have a
> > >>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to
> > >>>>wait and
> > >>>>>>>>>      >>>>signal just like
> > >>>>>>>>>      >>>>   we do
> > >>>>>>>>>      >>>>   >     in execbuf().
> > >>>>>>>>>      >>>>   >     The important thing is that we don't want
> > >>>>>>>>>one type of
> > >>>>>>>>>      >>>>operation to
> > >>>>>>>>>      >>>>   block
> > >>>>>>>>>      >>>>   >     on the other.  If immediate binds are
> > >>>>>>>>>blocking on sparse
> > >>>>>>>>>      binds,
> > >>>>>>>>>      >>>>   it's
> > >>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
> > >>>>>>>>>      >>>>   >     In terms of the internal implementation, I
> > >>>>>>>>>know that
> > >>>>>>>>>      >>>>there's going
> > >>>>>>>>>      >>>>   to be
> > >>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
> > >>>>>>>>>do these
> > >>>>>>>>>      things in
> > >>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
> > >>>>>>>>>      signaled and
> > >>>>>>>>>      >>>>   we're
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND
> > >>>>engine with
> > >>>>>>>>>      >>>>multiple queues
> > >>>>>>>>>      >>>>   feeding to it.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> Right.  As long as the queues themselves are
> > >>>>>>>>>independent and
> > >>>>>>>>>      >>>>can block on
> > >>>>>>>>>      >>>> dma_fences without holding up other queues, I think
> > >>>>>>>>>we're fine.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
> > >>>>>>>>>      >>>>there's a bit
> > >>>>>>>>>      >>>>   of
> > >>>>>>>>>      >>>>   > synchronization due to locking.  That's
> > >>>>>>>>>expected.  What
> > >>>>>>>>>      >>>>we can't
> > >>>>>>>>>      >>>>   afford
> > >>>>>>>>>      >>>>   >     to have is an immediate bind operation
> > >>>>>>>>>suddenly blocking
> > >>>>>>>>>      on a
> > >>>>>>>>>      >>>>   sparse
> > >>>>>>>>>      >>>>   > operation which is blocked on a compute job
> > >>>>>>>>>that's going
> > >>>>>>>>>      to run
> > >>>>>>>>>      >>>>   for
> > >>>>>>>>>      >>>>   >     another 5ms.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one
> > VM
> > >>>>>>>>>doesn't block
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>>   VM_BIND
> > >>>>>>>>>      >>>>   on other VMs. I am not sure about usecases
> > >>>>here, but just
> > >>>>>>>>>      wanted to
> > >>>>>>>>>      >>>>   clarify.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> Yes, that's what I would expect.
> > >>>>>>>>>      >>>> --Jason
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Niranjana
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
> > >>>>>>>>>      arbitrarily many
> > >>>>>>>>>      >>>>   paging
> > >>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
> > >>>>>>>>>engine/queue).  That
> > >>>>>>>>>      >>>>design works
> > >>>>>>>>>      >>>>   >     pretty well and solves the problems in
> > >>>>>>>>>question.       >>>>Again, we could
> > >>>>>>>>>      >>>>   just
> > >>>>>>>>>      >>>>   >     make everything out-of-order and require
> > >>>>>>>>>using syncobjs
> > >>>>>>>>>      >>>>to order
> > >>>>>>>>>      >>>>   things
> > >>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
> > >>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
> > >>>>>>>>>something on
> > >>>>>>>>>      >>>>IRC about
> > >>>>>>>>>      >>>>   VM_BIND
> > >>>>>>>>>      >>>>   >     queues waiting for syncobjs to
> > >>>>>>>>>materialize.  We don't
> > >>>>>>>>>      really
> > >>>>>>>>>      >>>>   want/need
> > >>>>>>>>>      >>>>   >     this. We already have all the machinery in
> > >>>>>>>>>userspace to
> > >>>>>>>>>      handle
> > >>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
> > >>>>>>>>>fences to
> > >>>>>>>>>      >>>>materialize
> > >>>>>>>>>      >>>>   and
> > >>>>>>>>>      >>>>   >     that machinery is on by default.  It
> > >>>>would actually
> > >>>>>>>>>      >>>>take MORE work
> > >>>>>>>>>      >>>>   in
> > >>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
> > >>>>>>>>>the kernel
> > >>>>>>>>>      >>>>being able to
> > >>>>>>>>>      >>>>   wait
> > >>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> > >>>>>>>>>that right is
> > >>>>>>>>>      >>>>   ridiculously
> > >>>>>>>>>      >>>>   >     hard and I really don't want to get it
> > >>>>>>>>>wrong in kernel
> > >>>>>>>>>      >>>>space.   �� When we
> > >>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
> > >>>>>>>>>be a thing.  We
> > >>>>>>>>>      don't
> > >>>>>>>>>      >>>>   need to
> > >>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
> > >>>>>>>>>      >>>>   >     --Jason
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   Thanks Jason,
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> > >>>>>>>>>we're allowed to
> > >>>>>>>>>      have a
> > >>>>>>>>>      >>>>   sparse
> > >>>>>>>>>      >>>>   >   queue that does not implement either graphics
> > >>>>>>>>>or compute
> > >>>>>>>>>      >>>>operations
> > >>>>>>>>>      >>>>   :
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     "While some implementations may include
> > >>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
> > >>>>>>>>>      >>>>   >     support in queue families that also include
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > graphics and compute support, other
> > >>>>>>>>>implementations may
> > >>>>>>>>>      only
> > >>>>>>>>>      >>>>   expose a
> > >>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > family."
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that
> > >>>>just does
> > >>>>>>>>>      bind/unbind
> > >>>>>>>>>      >>>>   > operations.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   But yes we need another engine for the
> > >>>>>>>>>immediate/non-sparse
> > >>>>>>>>>      >>>>   operations.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   -Lionel
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >         >
> > >>>>>>>>>      >>>>   > Daniel, any thoughts?
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > Niranjana
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >Matt
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Sorry I noticed this late.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> -Lionel
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >
> > >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-14 21:47                                                   ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-14 21:47 UTC (permalink / raw)
  To: Zeng, Oak, Vishwanathapura, Niranjana, Landwerlin, Lionel G
  Cc: Intel GFX, Maling list - DRI developers, Hellstrom, Thomas,
	Wilson, Chris P, Vetter,  Daniel, Christian König



Thanks,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Zeng, Oak
> Sent: June 14, 2022 5:13 PM
> To: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>;
> Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Wilson, Chris P
> <chris.p.wilson@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Vetter, Daniel <daniel.vetter@intel.com>;
> Christian König <christian.koenig@amd.com>
> Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> 
> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> > Sent: June 14, 2022 1:02 PM
> > To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> > Cc: Zeng, Oak <oak.zeng@intel.com>; Intel GFX <intel-
> > gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> > devel@lists.freedesktop.org>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Wilson, Chris P
> <chris.p.wilson@intel.com>;
> > Vetter, Daniel <daniel.vetter@intel.com>; Christian König
> > <christian.koenig@amd.com>
> > Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> > document
> >
> > On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> > >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> > >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> > >>>
> > >>>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>-----Original Message-----
> > >>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On
> > >>>>Behalf Of Niranjana
> > >>>>Vishwanathapura
> > >>>>Sent: June 10, 2022 1:43 PM
> > >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> > >>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list -
> > >>>>DRI developers <dri-
> > >>>>devel@lists.freedesktop.org>; Hellstrom, Thomas
> > >>>><thomas.hellstrom@intel.com>;
> > >>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> > >>>><daniel.vetter@intel.com>; Christian König
> > <christian.koenig@amd.com>
> > >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> > >>>>feature design
> > >>>>document
> > >>>>
> > >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> > >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> > >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> > >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> > >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin
> wrote:
> > >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> > >>>>>>>>>
> > >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> > >>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> > >>>>Ursulin wrote:
> > >>>>>>>>>      >
> > >>>>>>>>>      >
> > >>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> > >>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> > >>>>>>>>>Ekstrand wrote:
> > >>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> > >>>>Vishwanathapura
> > >>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> > >>>>>>>>>Landwerlin
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM
> > >>>>-0700, Matthew
> > >>>>>>>>>      >>>>Brost wrote:
> > >>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM
> > >>>>+0300, Lionel
> > >>>>>>>>>      Landwerlin
> > >>>>>>>>>      >>>>   wrote:
> > >>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana
> Vishwanathapura
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> > >>>>>>>>>      >>>>   binding/unbinding
> > >>>>>>>>>      >>>>   >       the mapping in an
> > >>>>>>>>>      >>>>   > >> > +async worker. The binding and
> > >>>>>>>>>unbinding will
> > >>>>>>>>>      >>>>work like a
> > >>>>>>>>>      >>>>   special
> > >>>>>>>>>      >>>>   >       GPU engine.
> > >>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
> > >>>>>>>>>      serialized and
> > >>>>>>>>>      >>>>   will
> > >>>>>>>>>      >>>>   >       wait on specified
> > >>>>>>>>>      >>>>   > >> > +input fences before the operation
> > >>>>>>>>>and will signal
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>>   output
> > >>>>>>>>>      >>>>   >       fences upon the
> > >>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
> > >>>>>>>>>      serialization,
> > >>>>>>>>>      >>>>   completion of
> > >>>>>>>>>      >>>>   >       an operation
> > >>>>>>>>>      >>>>   > >> > +will also indicate that all
> > >>>>>>>>>previous operations
> > >>>>>>>>>      >>>>are also
> > >>>>>>>>>      >>>>   > complete.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
> > >>>>>>>>>immediately
> > >>>>>>>>>      start
> > >>>>>>>>>      >>>>   > binding/unbinding" if
> > >>>>>>>>>      >>>>   > >> there are fences involved.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
> > >>>>>>>>>      >>>>worker seem to
> > >>>>>>>>>      >>>>   imply
> > >>>>>>>>>      >>>>   >       it's not
> > >>>>>>>>>      >>>>   > >> immediate.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Ok, will fix.
> > >>>>>>>>>      >>>>   >       This was added because in earlier design
> > >>>>>>>>>binding was
> > >>>>>>>>>      deferred
> > >>>>>>>>>      >>>>   until
> > >>>>>>>>>      >>>>   >       next execbuff.
> > >>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
> > >>>>>>>>>that sense).
> > >>>>>>>>>      >>>>But yah,
> > >>>>>>>>>      >>>>   this is
> > >>>>>>>>>      >>>>   > confusing
> > >>>>>>>>>      >>>>   >       and will fix it.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
> > >>>>>>>>>      >>>>operation when
> > >>>>>>>>>      >>>>   no
> > >>>>>>>>>      >>>>   >       input fence
> > >>>>>>>>>      >>>>   > >> is provided. Let say I do :
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In what order are the fences going to
> > >>>>>>>>>be signaled?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
> > >>>>>>>>>of order?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
> > >>>>>>>>>it's : in
> > >>>>>>>>>      order
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> > >>>>>>>>>ioctls. Note that
> > >>>>>>>>>      >>>>bind and
> > >>>>>>>>>      >>>>   unbind
> > >>>>>>>>>      >>>>   >       will use
> > >>>>>>>>>      >>>>   >       the same queue and hence are ordered.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> One thing I didn't realize is that
> > >>>>>>>>>because we only
> > >>>>>>>>>      get one
> > >>>>>>>>>      >>>>   > "VM_BIND" engine,
> > >>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
> > >>>>>>>>>specification.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
> > >>>>>>>>>serialized but
> > >>>>>>>>>      >>>>per engine.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> So you could have something like this :
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
> > >>>>>>>>>      out_fence=fence2)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
> > >>>>>>>>>      out_fence=fence4)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> fence1 is not signaled
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> fence3 is signaled
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
> > >>>>>>>>>      >>>>first VM_BIND.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
> > >>>>>>>>>      >>>>userspace by doing
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   >       wait
> > >>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
> > >>>>>>>>>fences useless.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
> > >>>>>>>>>rework this or
> > >>>>>>>>>      just
> > >>>>>>>>>      >>>>   deal with
> > >>>>>>>>>      >>>>   >       wait
> > >>>>>>>>>      >>>>   > >> fences in userspace?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   >       >My opinion is rework this but make the
> > >>>>>>>>>ordering via
> > >>>>>>>>>      >>>>an engine
> > >>>>>>>>>      >>>>   param
> > >>>>>>>>>      >>>>   > optional.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> > >>>>>>>>>are ordered
> > >>>>>>>>>      >>>>within the
> > >>>>>>>>>      >>>>   VM
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> > >>>>>>>>>accept an
> > >>>>>>>>>      engine
> > >>>>>>>>>      >>>>   argument
> > >>>>>>>>>      >>>>   >       (in
> > >>>>>>>>>      >>>>   > >the case of the i915 likely this is a
> > >>>>>>>>>gem context
> > >>>>>>>>>      >>>>handle) and
> > >>>>>>>>>      >>>>   binds
> > >>>>>>>>>      >>>>   > >ordered with respect to that engine.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >This gives UMDs options as the later
> > >>>>>>>>>likely consumes
> > >>>>>>>>>      >>>>more KMD
> > >>>>>>>>>      >>>>   > resources
> > >>>>>>>>>      >>>>   >       >so if a different UMD can live with
> > >>>>binds being
> > >>>>>>>>>      >>>>ordered within
> > >>>>>>>>>      >>>>   the VM
> > >>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       I think we need to be careful here if we
> > >>>>>>>>>are looking
> > >>>>>>>>>      for some
> > >>>>>>>>>      >>>>   out of
> > >>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
> > >>>>>>>>>      >>>>   > In-order completion means, in a batch of
> > >>>>>>>>>binds and
> > >>>>>>>>>      >>>>unbinds to be
> > >>>>>>>>>      >>>>   > completed in-order, user only needs to specify
> > >>>>>>>>>      >>>>in-fence for the
> > >>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
> > >>>>>>>>>for the last
> > >>>>>>>>>      >>>>   bind/unbind
> > >>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
> > >>>>>>>>>call can be
> > >>>>>>>>>      >>>>re-used by
> > >>>>>>>>>      >>>>   >       any subsequent bind call in that
> > >>>>in-order batch.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       These things will break if
> > >>>>>>>>>binding/unbinding were to
> > >>>>>>>>>      >>>>be allowed
> > >>>>>>>>>      >>>>   to
> > >>>>>>>>>      >>>>   >       go out of order (of submission) and user
> > >>>>>>>>>need to be
> > >>>>>>>>>      extra
> > >>>>>>>>>      >>>>   careful
> > >>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
> > >>>>>>>>>out-fence and
> > >>>>>>>>>      bind
> > >>>>>>>>>      >>>>   failing
> > >>>>>>>>>      >>>>   >       as VA is still in use etc.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided
> > >>>>mapping on the
> > >>>>>>>>>      specified
> > >>>>>>>>>      >>>>   address
> > >>>>>>>>>      >>>>   >       space
> > >>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> > >>>>>>>>>specific.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
> > >>>>>>>>>which can be
> > >>>>>>>>>      >>>>one from
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   > pre-defined queues,
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
> > >>>>>>>>>      >>>>   >       ...
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
> > >>>>>>>>>each queue which
> > >>>>>>>>>      will
> > >>>>>>>>>      >>>>   only
> > >>>>>>>>>      >>>>   >       bind the mappings on that queue in the
> > >>>>order of
> > >>>>>>>>>      submission.
> > >>>>>>>>>      >>>>   >       User can assign the queue to per engine
> > >>>>>>>>>or anything
> > >>>>>>>>>      >>>>like that.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       But again here, user need to be
> > >>>>careful and not
> > >>>>>>>>>      >>>>deadlock these
> > >>>>>>>>>      >>>>   >       queues with circular dependency of fences.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       I prefer adding this later an as
> > >>>>>>>>>extension based on
> > >>>>>>>>>      >>>>whether it
> > >>>>>>>>>      >>>>   >       is really helping with the implementation.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     I can tell you right now that having
> > >>>>>>>>>everything on a
> > >>>>>>>>>      single
> > >>>>>>>>>      >>>>   in-order
> > >>>>>>>>>      >>>>   >     queue will not get us the perf we want.
> > >>>>>>>>>What vulkan
> > >>>>>>>>>      >>>>really wants
> > >>>>>>>>>      >>>>   is one
> > >>>>>>>>>      >>>>   >     of two things:
> > >>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND
> > >>>>ops.  They just
> > >>>>>>>>>      happen in
> > >>>>>>>>>      >>>>   whatever
> > >>>>>>>>>      >>>>   >     their dependencies are resolved and we
> > >>>>>>>>>ensure ordering
> > >>>>>>>>>      >>>>ourselves
> > >>>>>>>>>      >>>>   by
> > >>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
> > >>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> > >>>>>>>>>queues.  We
> > >>>>>>>>>      need at
> > >>>>>>>>>      >>>>   least 2
> > >>>>>>>>>      >>>>   >     but I don't see why there needs to be a
> > >>>>>>>>>limit besides
> > >>>>>>>>>      >>>>the limits
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   >     i915 API already has on the number of
> > >>>>>>>>>engines.  Vulkan
> > >>>>>>>>>      could
> > >>>>>>>>>      >>>>   expose
> > >>>>>>>>>      >>>>   >     multiple sparse binding queues to the
> > >>>>>>>>>client if it's not
> > >>>>>>>>>      >>>>   arbitrarily
> > >>>>>>>>>      >>>>   >     limited.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Thanks Jason, Lionel.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Jason, what are you referring to when you say
> > >>>>>>>>>"limits the i915
> > >>>>>>>>>      API
> > >>>>>>>>>      >>>>   already
> > >>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
> > >>>>>>>>>there is such
> > >>>>>>>>>      an uapi
> > >>>>>>>>>      >>>>   today.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> There's a limit of something like 64 total engines
> > >>>>>>>>>today based on
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>> number of bits we can cram into the exec flags in
> > >>>>>>>>>execbuffer2.  I
> > >>>>>>>>>      think
> > >>>>>>>>>      >>>> someone had an extended version that allowed more
> > >>>>>>>>>but I ripped it
> > >>>>>>>>>      out
> > >>>>>>>>>      >>>> because no one was using it.  Of course,
> > >>>>>>>>>execbuffer3 might not
> > >>>>>>>>>      >>>>have that
> > >>>>>>>>>      >>>> problem at all.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Thanks Jason.
> > >>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> > >>>>>>>>>execbuffer3
> > >>>>>>>>>      probably
> > >>>>>>>>>      >>>will not have this limiation. So, we need to define a
> > >>>>>>>>>      VM_BIND_MAX_QUEUE
> > >>>>>>>>>      >>>and somehow export it to user (I am thinking of
> > >>>>>>>>>embedding it in
> > >>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
> > >>>>bits[1-3]->'n'
> > >>>>>>>>>      meaning 2^n
> > >>>>>>>>>      >>>queues.
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> > >>>>>>>>>(0x3f) which
> > >>>>>>>>>      execbuf3
> > >>>>>>>>>
> > >>>>>>>>>    Yup!  That's exactly the limit I was talking about.
> > >>>>>>>>>
> > >>>>>>>>>      >>will also have. So, we can simply define in
> vm_bind/unbind
> > >>>>>>>>>      structures,
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> > >>>>>>>>>      >>        __u32 queue;
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>I think that will keep things simple.
> > >>>>>>>>>      >
> > >>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how
> > >>>>many engines
> > >>>>>>>>>      >hardware can have? I suggest not to do that.
> > >>>>>>>>>      >
> > >>>>>>>>>      >Change with added this:
> > >>>>>>>>>      >
> > >>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> > >>>>>>>>>      >               return -EINVAL;
> > >>>>>>>>>      >
> > >>>>>>>>>      >To context creation needs to be undone and so let users
> > >>>>>>>>>create engine
> > >>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
> > >>>>>>>>>them all.
> > >>>>>>>>>      >
> > >>>>>>>>>
> > >>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> > >>>>>>>>>execbuff3 also.
> > >>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
> > >>>>>>>>>(64, or 65 if we
> > >>>>>>>>>      make it N+1).
> > >>>>>>>>>      But, as discussed in other thread of this RFC series, we
> > >>>>>>>>>are planning
> > >>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,
> > >>>>there won't be
> > >>>>>>>>>      any uapi that limits the number of engines (and hence
> > >>>>>>>>>the vm_bind
> > >>>>>>>>>      queues
> > >>>>>>>>>      need to be supported).
> > >>>>>>>>>
> > >>>>>>>>>      If we leave the number of vm_bind queues to be
> > >>>>arbitrarily large
> > >>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> > >>>>>>>>>queue (a wq,
> > >>>>>>>>>      work_item and a linked list) lookup from the user
> > >>>>>>>>>specified queue
> > >>>>>>>>>      index.
> > >>>>>>>>>      Other option is to just put some hard limit (say 64 or
> > >>>>>>>>>65) and use
> > >>>>>>>>>      an array of queues in VM (each created upon first use).
> > >>>>>>>>>I prefer this.
> > >>>>>>>>>
> > >>>>>>>>>    I don't get why a VM_BIND queue is any different from any
> > >>>>>>>>>other queue or
> > >>>>>>>>>    userspace-visible kernel object.  But I'll leave those
> > >>>>>>>>>details up to
> > >>>>>>>>>    danvet or whoever else might be reviewing the
> > implementation.
> > >>>>>>>>>    --Jason
> > >>>>>>>>>
> > >>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> > >>>>>>>>>queue created
> > >>>>>>>>>  like the others when we build the engine map?
> > >>>>>>>>>
> > >>>>>>>>>  For userspace it's then just matter of selecting the right
> > >>>>>>>>>queue ID when
> > >>>>>>>>>  submitting.
> > >>>>>>>>>
> > >>>>>>>>>  If there is ever a possibility to have this work on the GPU,
> > >>>>>>>>>it would be
> > >>>>>>>>>  all ready.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>I did sync offline with Matt Brost on this.
> > >>>>>>>>We can add a VM_BIND engine class and let user create
> VM_BIND
> > >>>>>>>>engines (queues).
> > >>>>>>>>The problem is, in i915 engine creating interface is bound to
> > >>>>>>>>gem_context.
> > >>>>>>>>So, in vm_bind ioctl, we would need both context_id and
> > >>>>>>>>queue_idx for proper
> > >>>>>>>>lookup of the user created engine. This is bit ackward as
> > >>>>vm_bind is an
> > >>>>>>>>interface to VM (address space) and has nothing to do with
> > >>>>gem_context.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>A gem_context has a single vm object right?
> > >>>>>>>
> > >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a
> > default
> > >>>>>>>one if not.
> > >>>>>>>
> > >>>>>>>So it's just like picking up the vm like it's done at execbuffer
> > >>>>>>>time right now : eb->context->vm
> > >>>>>>>
> > >>>>>>
> > >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
> > >>>>>>VM_BIND/UNBIND
> > >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
> > be
> > >>>>>>obtained
> > >>>>>>from the context?
> > >>>>>
> > >>>>>
> > >>>>>Yes, because if we go for engines, they're associated with a context
> > >>>>>and so also associated with the VM bound to the context.
> > >>>>>
> > >>>>
> > >>>>Hmm...context doesn't sould like the right interface. It should be
> > >>>>VM and engine (independent of context). Engine can be virtual or soft
> > >>>>engine (kernel thread), each with its own queue. We can add an
> > >>>>interface
> > >>>>to create such engines (independent of context). But we are anway
> > >>>>implicitly creating it when user uses a new queue_idx. If in future
> > >>>>we have hardware engines for VM_BIND operation, we can have that
> > >>>>explicit inteface to create engine instances and the queue_index
> > >>>>in vm_bind/unbind will point to those engines.
> > >>>>Anyone has any thoughts? Daniel?
> > >>>
> > >>>Exposing gem_context or intel_context to user space is a strange
> > >>>concept to me. A context represent some hw resources that is used
> > >>>to complete certain task. User space should care allocate some
> > >>>resources (memory, queues) and submit tasks to queues. But user
> > >>>space doesn't care how certain task is mapped to a HW context -
> > >>>driver/guc should take care of this.
> > >>>
> > >>>So a cleaner interface to me is: user space create a vm,  create
> > >>>gem object, vm_bind it to a vm; allocate queues (internally
> > >>>represent compute or blitter HW. Queue can be virtual to user) for
> > >>>this vm; submit tasks to queues. User can create multiple queues
> > >>>under one vm. One queue is only for one vm.
> > >>>
> > >>>I915 driver/guc manage the hw compute or blitter resources which
> > >>>is transparent to user space. When i915 or guc decide to schedule
> > >>>a queue (run tasks on that queue), a HW engine will be pick up and
> > >>>set up properly for the vm of that queue (ie., switch to page
> > >>>tables of that vm) - this is a context switch.
> > >>>
> > >>>From vm_bind perspective, it simply bind a gem_object to a vm.
> > >>>Engine/queue is not a parameter to vm_bind, as any engine can be
> > >>>pick up by i915/guc to execute a task using the vm bound va.
> > >>>
> > >>>I didn't completely follow the discussion here. Just share some
> > >>>thoughts.
> > >>>
> > >>
> > >>Yah, I agree.
> > >>
> > >>Lionel,
> > >>How about we define the queue as
> > >>union {
> > >>       __u32 queue_idx;
> > >>       __u64 rsvd;
> > >>}
> > >>
> > >>If required, we can extend by expanding the 'rsvd' field to <ctx_id,
> > >>queue_idx> later
> > >>with a flag.
> > >>
> > >>Niranjana
> > >
> > >
> > >I did not really understand Oak's comment nor what you're suggesting
> > >here to be honest.
> > >
> > >
> > >First the GEM context is already exposed to userspace. It's explicitly
> > >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
> > >
> > >We give the GEM context id in every execbuffer we do with
> > >drm_i915_gem_execbuffer2::rsvd1.
> > >
> > >It's still in the new execbuffer3 proposal being discussed.
> > >
> > >
> > >Second, the GEM context is also where we set the VM with
> > >I915_CONTEXT_PARAM_VM.
> > >
> > >
> > >Third, the GEM context also has the list of engines with
> > >I915_CONTEXT_PARAM_ENGINES.
> > >
> >
> > Yes, the execbuf and engine map creation are tied to gem_context.
> > (which probably is not the best interface.)
> >
> > >
> > >So it makes sense to me to dispatch the vm_bind operation to a GEM
> > >context, to a given vm_bind queue, because it's got all the
> > >information required :
> > >
> > >    - the list of new vm_bind queues
> > >
> > >    - the vm that is going to be modified
> > >
> >
> > But the operation is performed here on the address space (VM) which
> > can have multiple gem_contexts referring to it. So, VM is the right
> > interface here. We need not 'gem_context'ify it.
> >
> > All we need is multiple queue support for the address space (VM).
> > Going to gem_context for that just because we have engine creation
> > support there seems unnecessay and not correct to me.
> >
> > >
> > >Otherwise where do the vm_bind queues live?
> > >
> > >In the i915/drm fd object?
> > >
> > >That would mean that all the GEM contexts are sharing the same vm_bind
> > >queues.
> > >
> >
> > Not all, only the gem contexts that are using the same address space (VM).
> > But to me the right way to describe would be that "VM will be using those
> > queues".
> 
> 
> I hope by "queue" here you mean a HW resource  that will be later used to
> execute the job, for example a ccs compute engine. Of course queue can be
> virtual so user can create more queues than what hw physically has.
> 
> To express the concept of "VM will be using those queues", I think it make
> sense to have create_queue(vm) function taking a vm parameter. This
> means this queue is created for the purpose of submit job under this VM.
> Later on, we can submit job (referring to objects vm_bound to the same vm)
> to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just
> vm_bind (object, va, vm).
> 
> I hope the "queue" here is not the engine used to perform the vm_bind
> operation itself. But if you meant a queue/engine to perform vm_bind itself
> (vs a queue/engine for later job submission), then we can discuss more. I
> know xe driver have similar concept and I think align the design early can
> benefit the migration to xe driver.

Oops, I read more on this thread and it turned out the vm_bind queue here is actually used to perform vm bind/unbind operations. XE driver has the similar concept (except it is called engine_id there). So having a queue_idx parameter is closer to xe design.

That said, I still feel having a queue_idx parameter to vm_bind is a bit awkward. Vm_bind can be performed without any GPU engines, ie,. CPU itself can complete a vm bind as long as CPU have access to gpu's local memory. So the queue here have to be a virtual concept - it doesn't have a hard map to GPU blitter engine.

Can someone summarize what is the benefit of the queue-idx parameter? For the purpose of ordering vm_bind and later gpu jobs?  

> 
> Regards,
> Oak
> 
> >
> > Niranjana
> >
> > >
> > >intel_context or GuC are internal details we're not concerned about.
> > >
> > >I don't really see the connection with the GEM context.
> > >
> > >
> > >Maybe Oak has a different use case than Vulkan.
> > >
> > >
> > >-Lionel
> > >
> > >
> > >>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>
> > >>>>Niranjana
> > >>>>
> > >>>>>
> > >>>>>>I think the interface is clean as a interface to VM. It is
> > >>>>only that we
> > >>>>>>don't have a clean way to create a raw VM_BIND engine (not
> > >>>>>>associated with
> > >>>>>>any context) with i915 uapi.
> > >>>>>>May be we can add such an interface, but I don't think that is
> > >>>>worth it
> > >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl
> as I
> > >>>>>>mentioned
> > >>>>>>above).
> > >>>>>>Anyone has any thoughts?
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>Another problem is, if two VMs are binding with the same
> defined
> > >>>>>>>>engine,
> > >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
> > >>>>>>>>(which may be
> > >>>>>>>>waiting on its in_fence).
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
> > >>>>>>>with a single gem_context right now?
> > >>>>>>>
> > >>>>>>
> > >>>>>>No, we don't have 2 VMs for a gem_context.
> > >>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
> > >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> > >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> > >>>>>>those two queue indicies points to same underlying vm_bind
> engine,
> > >>>>>>then the second vm_bind call gets blocked until the first
> > >>>>vm_bind call's
> > >>>>>>'in' fence is triggered and bind completes.
> > >>>>>>
> > >>>>>>With per VM queues, this is not a problem as two VMs will not
> endup
> > >>>>>>sharing same queue.
> > >>>>>>
> > >>>>>>BTW, I just posted a updated PATCH series.
> > >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
> > >>>>>>
> > >>>>>>Niranjana
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>So, my preference here is to just add a 'u32 queue' index in
> > >>>>>>>>vm_bind/unbind
> > >>>>>>>>ioctl, and the queues are per VM.
> > >>>>>>>>
> > >>>>>>>>Niranjana
> > >>>>>>>>
> > >>>>>>>>>  Thanks,
> > >>>>>>>>>
> > >>>>>>>>>  -Lionel
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>      Niranjana
> > >>>>>>>>>
> > >>>>>>>>>      >Regards,
> > >>>>>>>>>      >
> > >>>>>>>>>      >Tvrtko
> > >>>>>>>>>      >
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>Niranjana
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>>   I am trying to see how many queues we need and
> > >>>>>>>>>don't want it to
> > >>>>>>>>>      be
> > >>>>>>>>>      >>>>   arbitrarily
> > >>>>>>>>>      >>>>   large and unduely blow up memory usage and
> > >>>>>>>>>complexity in i915
> > >>>>>>>>>      driver.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> > >>>>>>>>>vast majority
> > >>>>>>>>>      >>>>of cases. I
> > >>>>>>>>>      >>>> could imagine a client wanting to create more
> > >>>>than 1 sparse
> > >>>>>>>>>      >>>>queue in which
> > >>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> > >>>>>>>>>complexity
> > >>>>>>>>>      >>>>goes, once
> > >>>>>>>>>      >>>> you allow two, I don't think the complexity is
> > >>>>going up by
> > >>>>>>>>>      >>>>allowing N.  As
> > >>>>>>>>>      >>>> for memory usage, creating more queues means more
> > >>>>>>>>>memory.  That's
> > >>>>>>>>>      a
> > >>>>>>>>>      >>>> trade-off that userspace can make. Again, the
> > >>>>>>>>>expected number
> > >>>>>>>>>      >>>>here is 1
> > >>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
> > >>>>>>>>>you need to
> > >>>>>>>>>      worry.
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> > >>>>>>>>>      >>>That would require us create 8 workqueues.
> > >>>>>>>>>      >>>We can change 'n' later if required.
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Niranjana
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
> > >>>>>>>>>      >>>>operations and we
> > >>>>>>>>>      >>>>   don't
> > >>>>>>>>>      >>>>   >     want any dependencies between them:
> > >>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
> > >>>>>>>>>creation or
> > >>>>>>>>>      >>>>maybe as
> > >>>>>>>>>      >>>>   part of
> > >>>>>>>>>      >>>>   > vkBindImageMemory() or
> > VkBindBufferMemory().  These
> > >>>>>>>>>      >>>>don't happen
> > >>>>>>>>>      >>>>   on a
> > >>>>>>>>>      >>>>   >     queue and we don't want them serialized
> > >>>>>>>>>with anything.       To
> > >>>>>>>>>      >>>>   synchronize
> > >>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
> > >>>>>>>>>VkDevice which
> > >>>>>>>>>      is
> > >>>>>>>>>      >>>>   signaled by
> > >>>>>>>>>      >>>>   >     all immediate bind operations and make
> > >>>>>>>>>submits wait on
> > >>>>>>>>>      it.
> > >>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
> > >>>>>>>>>VkQueue which may
> > >>>>>>>>>      be the
> > >>>>>>>>>      >>>>   same as
> > >>>>>>>>>      >>>>   >     a render/compute queue or may be its own
> > >>>>>>>>>queue.  It's up
> > >>>>>>>>>      to us
> > >>>>>>>>>      >>>>   what we
> > >>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
> > >>>>>>>>>PoV, this is like
> > >>>>>>>>>      any
> > >>>>>>>>>      >>>>   other
> > >>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
> > >>>>>>>>>semaphores.       If we
> > >>>>>>>>>      >>>>   have a
> > >>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to
> > >>>>wait and
> > >>>>>>>>>      >>>>signal just like
> > >>>>>>>>>      >>>>   we do
> > >>>>>>>>>      >>>>   >     in execbuf().
> > >>>>>>>>>      >>>>   >     The important thing is that we don't want
> > >>>>>>>>>one type of
> > >>>>>>>>>      >>>>operation to
> > >>>>>>>>>      >>>>   block
> > >>>>>>>>>      >>>>   >     on the other.  If immediate binds are
> > >>>>>>>>>blocking on sparse
> > >>>>>>>>>      binds,
> > >>>>>>>>>      >>>>   it's
> > >>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
> > >>>>>>>>>      >>>>   >     In terms of the internal implementation, I
> > >>>>>>>>>know that
> > >>>>>>>>>      >>>>there's going
> > >>>>>>>>>      >>>>   to be
> > >>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
> > >>>>>>>>>do these
> > >>>>>>>>>      things in
> > >>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
> > >>>>>>>>>      signaled and
> > >>>>>>>>>      >>>>   we're
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND
> > >>>>engine with
> > >>>>>>>>>      >>>>multiple queues
> > >>>>>>>>>      >>>>   feeding to it.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> Right.  As long as the queues themselves are
> > >>>>>>>>>independent and
> > >>>>>>>>>      >>>>can block on
> > >>>>>>>>>      >>>> dma_fences without holding up other queues, I think
> > >>>>>>>>>we're fine.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
> > >>>>>>>>>      >>>>there's a bit
> > >>>>>>>>>      >>>>   of
> > >>>>>>>>>      >>>>   > synchronization due to locking.  That's
> > >>>>>>>>>expected.  What
> > >>>>>>>>>      >>>>we can't
> > >>>>>>>>>      >>>>   afford
> > >>>>>>>>>      >>>>   >     to have is an immediate bind operation
> > >>>>>>>>>suddenly blocking
> > >>>>>>>>>      on a
> > >>>>>>>>>      >>>>   sparse
> > >>>>>>>>>      >>>>   > operation which is blocked on a compute job
> > >>>>>>>>>that's going
> > >>>>>>>>>      to run
> > >>>>>>>>>      >>>>   for
> > >>>>>>>>>      >>>>   >     another 5ms.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one
> > VM
> > >>>>>>>>>doesn't block
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>>   VM_BIND
> > >>>>>>>>>      >>>>   on other VMs. I am not sure about usecases
> > >>>>here, but just
> > >>>>>>>>>      wanted to
> > >>>>>>>>>      >>>>   clarify.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> Yes, that's what I would expect.
> > >>>>>>>>>      >>>> --Jason
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Niranjana
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
> > >>>>>>>>>      arbitrarily many
> > >>>>>>>>>      >>>>   paging
> > >>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
> > >>>>>>>>>engine/queue).  That
> > >>>>>>>>>      >>>>design works
> > >>>>>>>>>      >>>>   >     pretty well and solves the problems in
> > >>>>>>>>>question.       >>>>Again, we could
> > >>>>>>>>>      >>>>   just
> > >>>>>>>>>      >>>>   >     make everything out-of-order and require
> > >>>>>>>>>using syncobjs
> > >>>>>>>>>      >>>>to order
> > >>>>>>>>>      >>>>   things
> > >>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
> > >>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
> > >>>>>>>>>something on
> > >>>>>>>>>      >>>>IRC about
> > >>>>>>>>>      >>>>   VM_BIND
> > >>>>>>>>>      >>>>   >     queues waiting for syncobjs to
> > >>>>>>>>>materialize.  We don't
> > >>>>>>>>>      really
> > >>>>>>>>>      >>>>   want/need
> > >>>>>>>>>      >>>>   >     this. We already have all the machinery in
> > >>>>>>>>>userspace to
> > >>>>>>>>>      handle
> > >>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
> > >>>>>>>>>fences to
> > >>>>>>>>>      >>>>materialize
> > >>>>>>>>>      >>>>   and
> > >>>>>>>>>      >>>>   >     that machinery is on by default.  It
> > >>>>would actually
> > >>>>>>>>>      >>>>take MORE work
> > >>>>>>>>>      >>>>   in
> > >>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
> > >>>>>>>>>the kernel
> > >>>>>>>>>      >>>>being able to
> > >>>>>>>>>      >>>>   wait
> > >>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> > >>>>>>>>>that right is
> > >>>>>>>>>      >>>>   ridiculously
> > >>>>>>>>>      >>>>   >     hard and I really don't want to get it
> > >>>>>>>>>wrong in kernel
> > >>>>>>>>>      >>>>space.   �� When we
> > >>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
> > >>>>>>>>>be a thing.  We
> > >>>>>>>>>      don't
> > >>>>>>>>>      >>>>   need to
> > >>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
> > >>>>>>>>>      >>>>   >     --Jason
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   Thanks Jason,
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> > >>>>>>>>>we're allowed to
> > >>>>>>>>>      have a
> > >>>>>>>>>      >>>>   sparse
> > >>>>>>>>>      >>>>   >   queue that does not implement either graphics
> > >>>>>>>>>or compute
> > >>>>>>>>>      >>>>operations
> > >>>>>>>>>      >>>>   :
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     "While some implementations may include
> > >>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
> > >>>>>>>>>      >>>>   >     support in queue families that also include
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > graphics and compute support, other
> > >>>>>>>>>implementations may
> > >>>>>>>>>      only
> > >>>>>>>>>      >>>>   expose a
> > >>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > family."
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that
> > >>>>just does
> > >>>>>>>>>      bind/unbind
> > >>>>>>>>>      >>>>   > operations.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   But yes we need another engine for the
> > >>>>>>>>>immediate/non-sparse
> > >>>>>>>>>      >>>>   operations.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   -Lionel
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >         >
> > >>>>>>>>>      >>>>   > Daniel, any thoughts?
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > Niranjana
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >Matt
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Sorry I noticed this late.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> -Lionel
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >
> > >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08 20:45                         ` Niranjana Vishwanathapura
@ 2022-06-15  9:49                           ` Tvrtko Ursulin
  -1 siblings, 0 replies; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-15  9:49 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Wilson, Chris P, Zanoni, Paulo R, intel-gfx, dri-devel,
	Hellstrom, Thomas, Lionel Landwerlin, Vetter, Daniel,
	christian.koenig


On 08/06/2022 21:45, Niranjana Vishwanathapura wrote:
> On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:
>>
>> On 08/06/2022 09:45, Lionel Landwerlin wrote:
>>> On 08/06/2022 11:36, Tvrtko Ursulin wrote:
>>>>
>>>> On 08/06/2022 07:40, Lionel Landwerlin wrote:
>>>>> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>>>>>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana 
>>>>>> Vishwanathapura wrote:
>>>>>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>>>>> On Tue, 2022-05-17 at 11:32 -0700, Niranjana 
>>>>>>>>>> Vishwanathapura wrote:
>>>>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>>>>
>>>>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>>>>      Also add new uapi and documentation as per review comments
>>>>>>>>>>>>      from Daniel.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>>>>>   1 file changed, 399 insertions(+)
>>>>>>>>>>>>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>> new file mode 100644
>>>>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>>>>> --- /dev/null
>>>>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>>>>> +/*
>>>>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>>>>> + */
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>>>>> + */
>>>>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding 
>>>>>>>>>> during VM creation.
>>>>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>>>>> execbuff mode of binding.
>>>>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept 
>>>>>>>>>> any execlist (ie., the
>>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 
>>>>>>>>>> extension must be provided
>>>>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>>>>> &drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>>>>> + * (not used) in VM_BIND mode. 
>>>>>>>>>> I915_EXEC_USE_EXTENSIONS flag must always be
>>>>>>>>>>>> + * set (See struct 
>>>>>>>>>> drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>>>>> + * The buffers_ptr, buffer_count, 
>>>>>>>>>> batch_start_offset and batch_len fields
>>>>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not 
>>>>>>>>>> used and must be 0.
>>>>>>>>>>>> + */
>>>>>>>>>>>
>>>>>>>>>>> From that description, it seems we have:
>>>>>>>>>>>
>>>>>>>>>>> struct drm_i915_gem_execbuffer2 {
>>>>>>>>>>>         __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>>>>         __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>>>>         __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>>>>         __u32 batch_len;                -> must be 0 (new)
>>>>>>>>>>>         __u32 DR1;                      -> must be 0 (old)
>>>>>>>>>>>         __u32 DR4;                      -> must be 0 (old)
>>>>>>>>>>>         __u32 num_cliprects; (fences)   -> must be 0 
>>>>>>>>>> since using extensions
>>>>>>>>>>>         __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>>>>> contains an actual pointer!
>>>>>>>>>>>         __u64 flags;                    -> some flags 
>>>>>>>>>> must be 0 (new)
>>>>>>>>>>>         __u64 rsvd1; (context info)     -> repurposed field 
>>>>>>>>>>> (old)
>>>>>>>>>>>         __u64 rsvd2;                    -> unused
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> Based on that, why can't we just get 
>>>>>>>>>> drm_i915_gem_execbuffer3 instead
>>>>>>>>>>> of adding even more complexity to an already abused 
>>>>>>>>>> interface? While
>>>>>>>>>>> the Vulkan-like extension thing is really nice, I don't think 
>>>>>>>>>>> what
>>>>>>>>>>> we're doing here is extending the ioctl usage, we're completely
>>>>>>>>>>> changing how the base struct should be interpreted 
>>>>>>>>>> based on how the VM
>>>>>>>>>>> was created (which is an entirely different ioctl).
>>>>>>>>>>>
>>>>>>>>>>> From Rusty Russel's API Design grading, 
>>>>>>>>>> drm_i915_gem_execbuffer2 is
>>>>>>>>>>> already at -6 without these changes. I think after 
>>>>>>>>>> vm_bind we'll need
>>>>>>>>>>> to create a -11 entry just to deal with this ioctl.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>>>>>> mode (other than natual extensions).
>>>>>>>>>> Adding a new execbuffer3 was considered, but I think we need 
>>>>>>>>>> to be careful
>>>>>>>>>> with that as that goes beyond the VM_BIND support, including 
>>>>>>>>>> any future
>>>>>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>>>>
>>>>>>>>> Why not? it's not like adding extensions here is really that 
>>>>>>>>> different
>>>>>>>>> than adding new ioctls.
>>>>>>>>>
>>>>>>>>> I definitely think this deserves an execbuffer3 without even
>>>>>>>>> considering future requirements. Just  to burn down the old
>>>>>>>>> requirements and pointless fields.
>>>>>>>>>
>>>>>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, 
>>>>>>>>> leave the
>>>>>>>>> older sw on execbuf2 for ever.
>>>>>>>>
>>>>>>>> I guess another point in favour of execbuf3 would be that it's less
>>>>>>>> midlayer. If we share the entry point then there's quite a few 
>>>>>>>> vfuncs
>>>>>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>>>>>> reloc/softping paths.
>>>>>>>>
>>>>>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>>>>>> vfunc, and then we share code (where it even makes sense, probably
>>>>>>>> request setup/submit need to be shared, anything else is probably
>>>>>>>> cleaner to just copypaste) with the usual helper approach.
>>>>>>>>
>>>>>>>> Also that would guarantee that really none of the old concepts like
>>>>>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>>>>>> into the new vm_bind execbuf.
>>>>>>>>
>>>>>>>> Finally I also think that copypasting would make backporting 
>>>>>>>> easier,
>>>>>>>> or at least more flexible, since it should make it easier to 
>>>>>>>> have the
>>>>>>>> upstream vm_bind co-exist with all the other things we have. 
>>>>>>>> Without
>>>>>>>> huge amounts of conflicts (or at least much less) that pushing a 
>>>>>>>> pile
>>>>>>>> of vfuncs into the existing code would cause.
>>>>>>>>
>>>>>>>> So maybe we should do this?
>>>>>>>
>>>>>>> Thanks Dave, Daniel.
>>>>>>> There are a few things that will be common between execbuf2 and
>>>>>>> execbuf3, like request setup/submit (as you said), fence handling 
>>>>>>> (timeline fences, fence array, composite fences), engine selection,
>>>>>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>>>>> bit position will differ).
>>>>>>> But I guess these should be fine as the suggestion here is to
>>>>>>> copy-paste the execbuff code and having a shared code where 
>>>>>>> possible.
>>>>>>> Besides, we can stop supporting some older feature in execbuff3
>>>>>>> (like fence array in favor of newer timeline fences), which will
>>>>>>> further reduce common code.
>>>>>>>
>>>>>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>>>>>
>>>>>>
>>>>>> Does this sound reasonable?
>>>>>
>>>>>
>>>>> Thanks for proposing this. Some comments below.
>>>>>
>>>>>
>>>>>>
>>>>>> struct drm_i915_gem_execbuffer3 {
>>>>>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>>>>
>>>>>>        __u32 batch_count;
>>>>>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch 
>>>>>> gpu virtual addresses */
>>>>>>
>>>>>>        __u64 flags;
>>>>>> #define I915_EXEC3_RING_MASK              (0x3f)
>>>>>> #define I915_EXEC3_DEFAULT                (0<<0)
>>>>>> #define I915_EXEC3_RENDER                 (1<<0)
>>>>>> #define I915_EXEC3_BSD                    (2<<0)
>>>>>> #define I915_EXEC3_BLT                    (3<<0)
>>>>>> #define I915_EXEC3_VEBOX                  (4<<0)
>>>>>
>>>>>
>>>>> Shouldn't we use the new engine selection uAPI instead?
>>>>>
>>>>> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES 
>>>>> in drm_i915_gem_context_create_ext_setparam.
>>>>>
>>>>> And you can also create virtual engines with the same extension.
>>>>>
>>>>> It feels like this could be a single u32 with the engine index (in 
>>>>> the context engine map).
>>>>
>>>> Yes I said the same yesterday.
>>>>
>>>> Also note that as you can't any longer set engines on a default 
>>>> context, question is whether userspace cares to use execbuf3 with it 
>>>> (default context).
>>>>
>>>> If it does, it will need an alternative engine selection for that 
>>>> case. I was proposing class:instance rather than legacy cumbersome 
>>>> flags.
>>>>
>>>> If it does not, I  mean if the decision is to only allow execbuf3 
>>>> with engine maps, then it leaves the default context a waste of 
>>>> kernel memory in the execbuf3 future. :( Don't know what to do there..
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>
>>>
>>> Thanks Tvrtko, I only saw your reply after responding.
>>>
>>>
>>> Both Iris & Anv create a context with engines (if kernel supports it) 
>>> : 
>>> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_gem.c#L73 
>>>
>>>
>>>
>>> I think we should be fine with just a single engine id and we don't 
>>> care about the default context.
>>
>> I wonder if in this case we could stop creating the default context 
>> starting from a future "gen"? Otherwise, with engine map only execbuf3 
>> and execbuf3 only userspace, it would serve no purpose apart from 
>> wasting kernel memory.
>>
> 
> Thanks Tvrtko, Lionell.
> 
> I will be glad to remove these flags, just define a uint32 engine_id and
> mandate a context with user engines map.
> 
> Regarding removing the default context, yah, it depends on from which gen
> onwards we will only be supporting execbuf3 and execbuf2 is fully
> deprecated. Till then, we will have to keep it I guess :(.

Forgot about this sub-thread.. I think it could be removed before 
execbuf2 is fully deprecated. We can make that decision with any new 
platform which needs UMD stack updates to be supported. But it is work 
for us to adjust IGT so I am not hopeful anyone will tackle it. We will 
just end up wasting memory.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-15  9:49                           ` Tvrtko Ursulin
  0 siblings, 0 replies; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-15  9:49 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Wilson, Chris P, Zanoni, Paulo R, intel-gfx, dri-devel,
	Hellstrom, Thomas, Vetter, Daniel, christian.koenig


On 08/06/2022 21:45, Niranjana Vishwanathapura wrote:
> On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:
>>
>> On 08/06/2022 09:45, Lionel Landwerlin wrote:
>>> On 08/06/2022 11:36, Tvrtko Ursulin wrote:
>>>>
>>>> On 08/06/2022 07:40, Lionel Landwerlin wrote:
>>>>> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>>>>>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana 
>>>>>> Vishwanathapura wrote:
>>>>>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>>>>> On Tue, 2022-05-17 at 11:32 -0700, Niranjana 
>>>>>>>>>> Vishwanathapura wrote:
>>>>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>>>>
>>>>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>>>>      Also add new uapi and documentation as per review comments
>>>>>>>>>>>>      from Daniel.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>>>>>   1 file changed, 399 insertions(+)
>>>>>>>>>>>>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>> new file mode 100644
>>>>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>>>>> --- /dev/null
>>>>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>>>>> +/*
>>>>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>>>>> + */
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>>>>> + */
>>>>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding 
>>>>>>>>>> during VM creation.
>>>>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>>>>> execbuff mode of binding.
>>>>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept 
>>>>>>>>>> any execlist (ie., the
>>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 
>>>>>>>>>> extension must be provided
>>>>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>>>>> &drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>>>>> + * (not used) in VM_BIND mode. 
>>>>>>>>>> I915_EXEC_USE_EXTENSIONS flag must always be
>>>>>>>>>>>> + * set (See struct 
>>>>>>>>>> drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>>>>> + * The buffers_ptr, buffer_count, 
>>>>>>>>>> batch_start_offset and batch_len fields
>>>>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not 
>>>>>>>>>> used and must be 0.
>>>>>>>>>>>> + */
>>>>>>>>>>>
>>>>>>>>>>> From that description, it seems we have:
>>>>>>>>>>>
>>>>>>>>>>> struct drm_i915_gem_execbuffer2 {
>>>>>>>>>>>         __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>>>>         __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>>>>         __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>>>>         __u32 batch_len;                -> must be 0 (new)
>>>>>>>>>>>         __u32 DR1;                      -> must be 0 (old)
>>>>>>>>>>>         __u32 DR4;                      -> must be 0 (old)
>>>>>>>>>>>         __u32 num_cliprects; (fences)   -> must be 0 
>>>>>>>>>> since using extensions
>>>>>>>>>>>         __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>>>>> contains an actual pointer!
>>>>>>>>>>>         __u64 flags;                    -> some flags 
>>>>>>>>>> must be 0 (new)
>>>>>>>>>>>         __u64 rsvd1; (context info)     -> repurposed field 
>>>>>>>>>>> (old)
>>>>>>>>>>>         __u64 rsvd2;                    -> unused
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> Based on that, why can't we just get 
>>>>>>>>>> drm_i915_gem_execbuffer3 instead
>>>>>>>>>>> of adding even more complexity to an already abused 
>>>>>>>>>> interface? While
>>>>>>>>>>> the Vulkan-like extension thing is really nice, I don't think 
>>>>>>>>>>> what
>>>>>>>>>>> we're doing here is extending the ioctl usage, we're completely
>>>>>>>>>>> changing how the base struct should be interpreted 
>>>>>>>>>> based on how the VM
>>>>>>>>>>> was created (which is an entirely different ioctl).
>>>>>>>>>>>
>>>>>>>>>>> From Rusty Russel's API Design grading, 
>>>>>>>>>> drm_i915_gem_execbuffer2 is
>>>>>>>>>>> already at -6 without these changes. I think after 
>>>>>>>>>> vm_bind we'll need
>>>>>>>>>>> to create a -11 entry just to deal with this ioctl.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>>>>>> mode (other than natual extensions).
>>>>>>>>>> Adding a new execbuffer3 was considered, but I think we need 
>>>>>>>>>> to be careful
>>>>>>>>>> with that as that goes beyond the VM_BIND support, including 
>>>>>>>>>> any future
>>>>>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>>>>>
>>>>>>>>> Why not? it's not like adding extensions here is really that 
>>>>>>>>> different
>>>>>>>>> than adding new ioctls.
>>>>>>>>>
>>>>>>>>> I definitely think this deserves an execbuffer3 without even
>>>>>>>>> considering future requirements. Just  to burn down the old
>>>>>>>>> requirements and pointless fields.
>>>>>>>>>
>>>>>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, 
>>>>>>>>> leave the
>>>>>>>>> older sw on execbuf2 for ever.
>>>>>>>>
>>>>>>>> I guess another point in favour of execbuf3 would be that it's less
>>>>>>>> midlayer. If we share the entry point then there's quite a few 
>>>>>>>> vfuncs
>>>>>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>>>>>> reloc/softping paths.
>>>>>>>>
>>>>>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>>>>>> vfunc, and then we share code (where it even makes sense, probably
>>>>>>>> request setup/submit need to be shared, anything else is probably
>>>>>>>> cleaner to just copypaste) with the usual helper approach.
>>>>>>>>
>>>>>>>> Also that would guarantee that really none of the old concepts like
>>>>>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>>>>>> into the new vm_bind execbuf.
>>>>>>>>
>>>>>>>> Finally I also think that copypasting would make backporting 
>>>>>>>> easier,
>>>>>>>> or at least more flexible, since it should make it easier to 
>>>>>>>> have the
>>>>>>>> upstream vm_bind co-exist with all the other things we have. 
>>>>>>>> Without
>>>>>>>> huge amounts of conflicts (or at least much less) that pushing a 
>>>>>>>> pile
>>>>>>>> of vfuncs into the existing code would cause.
>>>>>>>>
>>>>>>>> So maybe we should do this?
>>>>>>>
>>>>>>> Thanks Dave, Daniel.
>>>>>>> There are a few things that will be common between execbuf2 and
>>>>>>> execbuf3, like request setup/submit (as you said), fence handling 
>>>>>>> (timeline fences, fence array, composite fences), engine selection,
>>>>>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>>>>> bit position will differ).
>>>>>>> But I guess these should be fine as the suggestion here is to
>>>>>>> copy-paste the execbuff code and having a shared code where 
>>>>>>> possible.
>>>>>>> Besides, we can stop supporting some older feature in execbuff3
>>>>>>> (like fence array in favor of newer timeline fences), which will
>>>>>>> further reduce common code.
>>>>>>>
>>>>>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>>>>>
>>>>>>
>>>>>> Does this sound reasonable?
>>>>>
>>>>>
>>>>> Thanks for proposing this. Some comments below.
>>>>>
>>>>>
>>>>>>
>>>>>> struct drm_i915_gem_execbuffer3 {
>>>>>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>>>>>
>>>>>>        __u32 batch_count;
>>>>>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch 
>>>>>> gpu virtual addresses */
>>>>>>
>>>>>>        __u64 flags;
>>>>>> #define I915_EXEC3_RING_MASK              (0x3f)
>>>>>> #define I915_EXEC3_DEFAULT                (0<<0)
>>>>>> #define I915_EXEC3_RENDER                 (1<<0)
>>>>>> #define I915_EXEC3_BSD                    (2<<0)
>>>>>> #define I915_EXEC3_BLT                    (3<<0)
>>>>>> #define I915_EXEC3_VEBOX                  (4<<0)
>>>>>
>>>>>
>>>>> Shouldn't we use the new engine selection uAPI instead?
>>>>>
>>>>> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES 
>>>>> in drm_i915_gem_context_create_ext_setparam.
>>>>>
>>>>> And you can also create virtual engines with the same extension.
>>>>>
>>>>> It feels like this could be a single u32 with the engine index (in 
>>>>> the context engine map).
>>>>
>>>> Yes I said the same yesterday.
>>>>
>>>> Also note that as you can't any longer set engines on a default 
>>>> context, question is whether userspace cares to use execbuf3 with it 
>>>> (default context).
>>>>
>>>> If it does, it will need an alternative engine selection for that 
>>>> case. I was proposing class:instance rather than legacy cumbersome 
>>>> flags.
>>>>
>>>> If it does not, I  mean if the decision is to only allow execbuf3 
>>>> with engine maps, then it leaves the default context a waste of 
>>>> kernel memory in the execbuf3 future. :( Don't know what to do there..
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>
>>>
>>> Thanks Tvrtko, I only saw your reply after responding.
>>>
>>>
>>> Both Iris & Anv create a context with engines (if kernel supports it) 
>>> : 
>>> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_gem.c#L73 
>>>
>>>
>>>
>>> I think we should be fine with just a single engine id and we don't 
>>> care about the default context.
>>
>> I wonder if in this case we could stop creating the default context 
>> starting from a future "gen"? Otherwise, with engine map only execbuf3 
>> and execbuf3 only userspace, it would serve no purpose apart from 
>> wasting kernel memory.
>>
> 
> Thanks Tvrtko, Lionell.
> 
> I will be glad to remove these flags, just define a uint32 engine_id and
> mandate a context with user engines map.
> 
> Regarding removing the default context, yah, it depends on from which gen
> onwards we will only be supporting execbuf3 and execbuf2 is fully
> deprecated. Till then, we will have to keep it I guess :(.

Forgot about this sub-thread.. I think it could be removed before 
execbuf2 is fully deprecated. We can make that decision with any new 
platform which needs UMD stack updates to be supported. But it is work 
for us to adjust IGT so I am not hopeful anyone will tackle it. We will 
just end up wasting memory.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2022-06-15  9:49 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 18:32 [Intel-gfx] [RFC v3 0/3] drm/doc/rfc: i915 VM_BIND feature design + uapi Niranjana Vishwanathapura
2022-05-17 18:32 ` Niranjana Vishwanathapura
2022-05-17 18:32 ` [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document Niranjana Vishwanathapura
2022-05-17 18:32   ` Niranjana Vishwanathapura
2022-05-19 22:52   ` Zanoni, Paulo R
2022-05-19 22:52     ` [Intel-gfx] " Zanoni, Paulo R
2022-05-23 19:05     ` Niranjana Vishwanathapura
2022-05-23 19:05       ` [Intel-gfx] " Niranjana Vishwanathapura
2022-05-23 19:08       ` Niranjana Vishwanathapura
2022-05-23 19:08         ` [Intel-gfx] " Niranjana Vishwanathapura
2022-05-24 10:08     ` Lionel Landwerlin
2022-06-01 14:25   ` Lionel Landwerlin
2022-06-01 20:28     ` Matthew Brost
2022-06-01 20:28       ` Matthew Brost
2022-06-02 20:11       ` Niranjana Vishwanathapura
2022-06-02 20:11         ` Niranjana Vishwanathapura
2022-06-02 20:35         ` Jason Ekstrand
2022-06-02 20:35           ` Jason Ekstrand
2022-06-03  7:20           ` Lionel Landwerlin
2022-06-03 23:51             ` Niranjana Vishwanathapura
2022-06-03 23:51               ` Niranjana Vishwanathapura
2022-06-07 17:12               ` Jason Ekstrand
2022-06-07 17:12                 ` Jason Ekstrand
2022-06-07 18:18                 ` Niranjana Vishwanathapura
2022-06-07 18:18                   ` Niranjana Vishwanathapura
2022-06-07 21:32                   ` Niranjana Vishwanathapura
2022-06-08  7:33                     ` Tvrtko Ursulin
2022-06-08 21:44                       ` Niranjana Vishwanathapura
2022-06-08 21:44                         ` Niranjana Vishwanathapura
2022-06-08 21:55                         ` Jason Ekstrand
2022-06-08 21:55                           ` Jason Ekstrand
2022-06-08 22:48                           ` Niranjana Vishwanathapura
2022-06-08 22:48                             ` Niranjana Vishwanathapura
2022-06-09 14:49                           ` Lionel Landwerlin
2022-06-09 19:31                             ` Niranjana Vishwanathapura
2022-06-09 19:31                               ` Niranjana Vishwanathapura
2022-06-10  6:53                               ` Lionel Landwerlin
2022-06-10  6:53                                 ` Lionel Landwerlin
2022-06-10  7:54                                 ` Niranjana Vishwanathapura
2022-06-10  7:54                                   ` Niranjana Vishwanathapura
2022-06-10  8:18                                   ` Lionel Landwerlin
2022-06-10  8:18                                     ` Lionel Landwerlin
2022-06-10 17:42                                     ` Niranjana Vishwanathapura
2022-06-10 17:42                                       ` Niranjana Vishwanathapura
2022-06-13 13:33                                       ` Zeng, Oak
2022-06-13 13:33                                         ` Zeng, Oak
2022-06-13 18:02                                         ` Niranjana Vishwanathapura
2022-06-13 18:02                                           ` Niranjana Vishwanathapura
2022-06-14  7:04                                           ` Lionel Landwerlin
2022-06-14 17:01                                             ` Niranjana Vishwanathapura
2022-06-14 17:01                                               ` Niranjana Vishwanathapura
2022-06-14 21:12                                               ` Zeng, Oak
2022-06-14 21:12                                                 ` Zeng, Oak
2022-06-14 21:47                                                 ` Zeng, Oak
2022-06-14 21:47                                                   ` Zeng, Oak
2022-06-01 21:18     ` Matthew Brost
2022-06-01 21:18       ` Matthew Brost
2022-06-02  5:42       ` Lionel Landwerlin
2022-06-02  5:42         ` Lionel Landwerlin
2022-06-02 16:22         ` Matthew Brost
2022-06-02 16:22           ` Matthew Brost
2022-06-02 20:24           ` Niranjana Vishwanathapura
2022-06-02 20:24             ` Niranjana Vishwanathapura
2022-06-02 20:16         ` Bas Nieuwenhuizen
2022-06-02 20:16           ` Bas Nieuwenhuizen
2022-06-02  2:13   ` Zeng, Oak
2022-06-02  2:13     ` [Intel-gfx] " Zeng, Oak
2022-06-02 20:48     ` Niranjana Vishwanathapura
2022-06-02 20:48       ` [Intel-gfx] " Niranjana Vishwanathapura
2022-06-06 20:45       ` Zeng, Oak
2022-06-06 20:45         ` [Intel-gfx] " Zeng, Oak
2022-05-17 18:32 ` [Intel-gfx] [RFC v3 2/3] drm/i915: Update i915 uapi documentation Niranjana Vishwanathapura
2022-05-17 18:32   ` Niranjana Vishwanathapura
2022-06-08 11:24   ` Matthew Auld
2022-06-08 11:24     ` [Intel-gfx] " Matthew Auld
2022-06-10  1:43     ` Niranjana Vishwanathapura
2022-06-10  1:43       ` [Intel-gfx] " Niranjana Vishwanathapura
2022-05-17 18:32 ` [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition Niranjana Vishwanathapura
2022-05-17 18:32   ` Niranjana Vishwanathapura
2022-05-19 23:07   ` [Intel-gfx] " Zanoni, Paulo R
2022-05-23 19:19     ` Niranjana Vishwanathapura
2022-06-01  9:02       ` Dave Airlie
2022-06-01  9:27         ` Daniel Vetter
2022-06-01  9:27           ` Daniel Vetter
2022-06-02  5:08           ` Niranjana Vishwanathapura
2022-06-02  5:08             ` Niranjana Vishwanathapura
2022-06-03  6:53             ` Niranjana Vishwanathapura
2022-06-07 10:42               ` Tvrtko Ursulin
2022-06-07 21:25                 ` Niranjana Vishwanathapura
2022-06-08  7:34                   ` Tvrtko Ursulin
2022-06-08 19:52                     ` Niranjana Vishwanathapura
2022-06-08  6:40               ` Lionel Landwerlin
2022-06-08  6:43                 ` Lionel Landwerlin
2022-06-08  8:36                 ` Tvrtko Ursulin
2022-06-08  8:45                   ` Lionel Landwerlin
2022-06-08  8:54                     ` Tvrtko Ursulin
2022-06-08 20:45                       ` Niranjana Vishwanathapura
2022-06-08 20:45                         ` Niranjana Vishwanathapura
2022-06-15  9:49                         ` Tvrtko Ursulin
2022-06-15  9:49                           ` Tvrtko Ursulin
2022-06-08  7:12               ` Lionel Landwerlin
2022-06-08 21:24                 ` Matthew Brost
2022-06-08 21:24                   ` Matthew Brost
2022-06-07 10:27   ` Tvrtko Ursulin
2022-06-07 19:37     ` Niranjana Vishwanathapura
2022-06-08  7:17       ` Tvrtko Ursulin
2022-06-08  9:12         ` Matthew Auld
2022-06-08 21:32           ` Niranjana Vishwanathapura
2022-06-08 21:32             ` Niranjana Vishwanathapura
2022-06-09  8:36             ` Matthew Auld
2022-06-09  8:36               ` Matthew Auld
2022-06-09 18:53               ` Niranjana Vishwanathapura
2022-06-09 18:53                 ` Niranjana Vishwanathapura
2022-06-10 10:16                 ` Tvrtko Ursulin
2022-06-10 10:32                   ` Matthew Auld
2022-06-10  8:34   ` Matthew Brost
2022-06-10  8:34     ` [Intel-gfx] " Matthew Brost
2022-05-17 20:49 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3) Patchwork
2022-05-17 20:49 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-05-17 21:09 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-05-18  2:33 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.