All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [RFC v3 0/3] drm/doc/rfc: i915 VM_BIND feature design + uapi
@ 2022-05-17 18:32 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

This is the i915 driver VM_BIND feature design RFC patch series along
with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation.
v3: Add more documentation and proper kernel-doc formatting with cross
    references (including missing i915_drm uapi kernel-docs which are
    required) as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Niranjana Vishwanathapura (3):
  drm/doc/rfc: VM_BIND feature design document
  drm/i915: Update i915 uapi documentation
  drm/doc/rfc: VM_BIND uapi definition

 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.h   | 399 +++++++++++++++++++++++++
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 include/uapi/drm/i915_drm.h            | 153 +++++++---
 5 files changed, 825 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC v3 0/3] drm/doc/rfc: i915 VM_BIND feature design + uapi
@ 2022-05-17 18:32 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

This is the i915 driver VM_BIND feature design RFC patch series along
with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation.
v3: Add more documentation and proper kernel-doc formatting with cross
    references (including missing i915_drm uapi kernel-docs which are
    required) as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Niranjana Vishwanathapura (3):
  drm/doc/rfc: VM_BIND feature design document
  drm/i915: Update i915 uapi documentation
  drm/doc/rfc: VM_BIND uapi definition

 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.h   | 399 +++++++++++++++++++++++++
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 include/uapi/drm/i915_drm.h            | 153 +++++++---
 5 files changed, 825 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32 ` Niranjana Vishwanathapura
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 310 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 36a76cbe9095..64cb924ec5bb 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+.. _indefinite_dma_fences:
+
 Indefinite DMA Fences
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
+* Support for userptr gem objects (no special uapi is required for this).
+
+Execbuff ioctl in VM_BIND mode
+-------------------------------
+The execbuff ioctl handling in VM_BIND mode differs significantly from the
+older method. A VM in VM_BIND mode will not support older execbuff mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
+no support for implicit sync. It is expected that the below work will be able
+to support requirements of object dependency setting in all use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)
+
+This also means, we need an execbuff extension to pass in the batch
+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+
+If at all execlist support in execbuff ioctl is deemed necessary for
+implicit sync in certain use cases, then support can be added later.
+
+In VM_BIND mode, VA allocation is completely managed by the user instead of
+the i915 driver. Hence all VA assignment, eviction are not applicable in
+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
+be using the i915_vma active reference tracking. It will instead use dma-resv
+object for that (See `VM_BIND dma_resv usage`_).
+
+So, a lot of existing code in the execbuff path like relocations, VA evictions,
+vma lookup table, implicit sync, vma active reference tracking etc., are not
+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
+by clearly separating out the functionalities where the VM_BIND mode differs
+from older method and they should be moved to separate files.
+
+VM_PRIVATE objects
+-------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus, the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+The locking design here supports the older (execlist based) execbuff mode, the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
+system allocator support (See `Shared Virtual Memory (SVM) support`_).
+The older execbuff mode and the newer VM_BIND mode without page faults manages
+residency of backing storage using dma_fence. The VM_BIND mode with page faults
+and the system allocator support do not use any dma_fence at all.
+
+VM_BIND locking order is as below.
+
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
+   mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple page fault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+   The older execbuff mode of binding do not need this lock.
+
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
+   be held while binding/unbinding a vma in the async worker and while updating
+   dma-resv fence list of an object. Note that private BOs of a VM will all
+   share a dma-resv object.
+
+   The future system allocator support will use the HMM prescribed locking
+   instead.
+
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
+   invalidated vmas (due to eviction and userptr invalidation) etc.
+
+When GPU page faults are supported, the execbuff path do not take any of these
+locks. There we will simply smash the new batch buffer address into the ring and
+then tell the scheduler run that. The lock taking only happens from the page
+fault handler, where we take lock-A in read mode, whichever lock-B we need to
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
+system allocator) and some additional locks (lock-D) for taking care of page
+table races. Page fault mode should not need to ever manipulate the vm lists,
+so won't ever need lock-C.
+
+VM_BIND LRU handling
+---------------------
+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
+performance degradation. We will also need support for bulk LRU movement of
+VM_BIND objects to avoid additional latencies in execbuff path.
+
+The page table pages are similar to VM_BIND mapped objects (See
+`Evictable page table allocations`_) and are maintained per VM and needs to
+be pinned in memory when VM is made active (ie., upon an execbuff call with
+that VM). So, bulk LRU movement of page table pages is also needed.
+
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
+over to the ttm LRU in some fashion to make sure we once again have a reasonable
+and consistent memory aging and reclaim architecture.
+
+VM_BIND dma_resv usage
+-----------------------
+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
+over sync (See enum dma_resv_usage). One can override it with either
+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
+setting (either through explicit or implicit mechanism).
+
+When vm_bind is called for a non-private object while the VM is already
+active, the fences need to be copied from VM's shared dma-resv object
+(common to all private objects of the VM) to this non-private object.
+If this results in performance degradation, then some optimization will
+be needed here. This is not a problem for VM's private objects as they use
+shared dma-resv object which is always updated on each execbuff submission.
+
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
+older i915_vma active reference tracking which is deprecated. This should be
+easier to get it working with the current TTM backend. We can remove the
+i915_vma active reference tracking fully while supporting TTM backend for igfx.
+
+Evictable page table allocations
+---------------------------------
+Make pagetable allocations evictable and manage them similar to VM_BIND
+mapped objects. Page table pages are similar to persistent mappings of a
+VM (difference here are that the page table pages will not have an i915_vma
+structure and after swapping pages back in, parent page link needs to be
+updated).
+
+Mesa use case
+--------------
+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
+hence improving performance of CPU-bound applications. It also allows us to
+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
+reducing CPU overhead becomes more impactful.
+
+
+VM_BIND Compute support
+========================
+
+User/Memory Fence
+------------------
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter. User/Memory fence is a <address, value> pair. To signal the
+user fence, specified value will be written at the specified virtual address
+and wakeup the waiting process. User can wait on a user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal a user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
+context creation. The dma-fence based user interfaces like gem_wait ioctl and
+execbuff out fence are not allowed on long running contexts. Implicit sync is
+not valid as well and is anyway not supported in VM_BIND mode.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context preempt fence (also called suspend fence) proxying
+as i915_request fence. This suspend fence is enabled when someone tries to wait
+on it, which then triggers the context preemption.
+
+As this support for context suspension using a preempt fence and the resume work
+for the compute mode contexts can get tricky to get it right, it is better to
+add this support in drm scheduler so that multiple drivers can make use of it.
+That means, it will have a dependency on i915 drm scheduler conversion with GuC
+scheduler backend. This should be fine, as the plan is to support compute mode
+contexts only with GuC scheduler backend (at least initially). This is much
+easier to support with VM_BIND mode compared to the current heavier execbuff
+path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. This is made possible by VM_BIND is not being synchronized against
+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
+submitted jobs.
+
+Other VM_BIND use cases
+========================
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debugged) and attached
+to GPU via vm_bind interface.
+
+GPU page faults
+----------------
+GPU page faults when supported (in future), will only be supported in the
+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
+binding will require using dma-fence to ensure residency, the GPU page faults
+mode when supported, will not use any dma-fence as residency is purely managed
+by installing and removing/invalidating page table entries.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only mapping, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface. SVM is only supported with GPU page
+faults enabled.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+use cases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
+  is active or not.
+
+
+VM_BIND UAPI
+=============
+
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 310 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 36a76cbe9095..64cb924ec5bb 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+.. _indefinite_dma_fences:
+
 Indefinite DMA Fences
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
+* Support for userptr gem objects (no special uapi is required for this).
+
+Execbuff ioctl in VM_BIND mode
+-------------------------------
+The execbuff ioctl handling in VM_BIND mode differs significantly from the
+older method. A VM in VM_BIND mode will not support older execbuff mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
+no support for implicit sync. It is expected that the below work will be able
+to support requirements of object dependency setting in all use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)
+
+This also means, we need an execbuff extension to pass in the batch
+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+
+If at all execlist support in execbuff ioctl is deemed necessary for
+implicit sync in certain use cases, then support can be added later.
+
+In VM_BIND mode, VA allocation is completely managed by the user instead of
+the i915 driver. Hence all VA assignment, eviction are not applicable in
+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
+be using the i915_vma active reference tracking. It will instead use dma-resv
+object for that (See `VM_BIND dma_resv usage`_).
+
+So, a lot of existing code in the execbuff path like relocations, VA evictions,
+vma lookup table, implicit sync, vma active reference tracking etc., are not
+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
+by clearly separating out the functionalities where the VM_BIND mode differs
+from older method and they should be moved to separate files.
+
+VM_PRIVATE objects
+-------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus, the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+The locking design here supports the older (execlist based) execbuff mode, the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
+system allocator support (See `Shared Virtual Memory (SVM) support`_).
+The older execbuff mode and the newer VM_BIND mode without page faults manages
+residency of backing storage using dma_fence. The VM_BIND mode with page faults
+and the system allocator support do not use any dma_fence at all.
+
+VM_BIND locking order is as below.
+
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
+   mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple page fault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+   The older execbuff mode of binding do not need this lock.
+
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
+   be held while binding/unbinding a vma in the async worker and while updating
+   dma-resv fence list of an object. Note that private BOs of a VM will all
+   share a dma-resv object.
+
+   The future system allocator support will use the HMM prescribed locking
+   instead.
+
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
+   invalidated vmas (due to eviction and userptr invalidation) etc.
+
+When GPU page faults are supported, the execbuff path do not take any of these
+locks. There we will simply smash the new batch buffer address into the ring and
+then tell the scheduler run that. The lock taking only happens from the page
+fault handler, where we take lock-A in read mode, whichever lock-B we need to
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
+system allocator) and some additional locks (lock-D) for taking care of page
+table races. Page fault mode should not need to ever manipulate the vm lists,
+so won't ever need lock-C.
+
+VM_BIND LRU handling
+---------------------
+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
+performance degradation. We will also need support for bulk LRU movement of
+VM_BIND objects to avoid additional latencies in execbuff path.
+
+The page table pages are similar to VM_BIND mapped objects (See
+`Evictable page table allocations`_) and are maintained per VM and needs to
+be pinned in memory when VM is made active (ie., upon an execbuff call with
+that VM). So, bulk LRU movement of page table pages is also needed.
+
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
+over to the ttm LRU in some fashion to make sure we once again have a reasonable
+and consistent memory aging and reclaim architecture.
+
+VM_BIND dma_resv usage
+-----------------------
+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
+over sync (See enum dma_resv_usage). One can override it with either
+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
+setting (either through explicit or implicit mechanism).
+
+When vm_bind is called for a non-private object while the VM is already
+active, the fences need to be copied from VM's shared dma-resv object
+(common to all private objects of the VM) to this non-private object.
+If this results in performance degradation, then some optimization will
+be needed here. This is not a problem for VM's private objects as they use
+shared dma-resv object which is always updated on each execbuff submission.
+
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
+older i915_vma active reference tracking which is deprecated. This should be
+easier to get it working with the current TTM backend. We can remove the
+i915_vma active reference tracking fully while supporting TTM backend for igfx.
+
+Evictable page table allocations
+---------------------------------
+Make pagetable allocations evictable and manage them similar to VM_BIND
+mapped objects. Page table pages are similar to persistent mappings of a
+VM (difference here are that the page table pages will not have an i915_vma
+structure and after swapping pages back in, parent page link needs to be
+updated).
+
+Mesa use case
+--------------
+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
+hence improving performance of CPU-bound applications. It also allows us to
+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
+reducing CPU overhead becomes more impactful.
+
+
+VM_BIND Compute support
+========================
+
+User/Memory Fence
+------------------
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter. User/Memory fence is a <address, value> pair. To signal the
+user fence, specified value will be written at the specified virtual address
+and wakeup the waiting process. User can wait on a user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal a user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
+context creation. The dma-fence based user interfaces like gem_wait ioctl and
+execbuff out fence are not allowed on long running contexts. Implicit sync is
+not valid as well and is anyway not supported in VM_BIND mode.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context preempt fence (also called suspend fence) proxying
+as i915_request fence. This suspend fence is enabled when someone tries to wait
+on it, which then triggers the context preemption.
+
+As this support for context suspension using a preempt fence and the resume work
+for the compute mode contexts can get tricky to get it right, it is better to
+add this support in drm scheduler so that multiple drivers can make use of it.
+That means, it will have a dependency on i915 drm scheduler conversion with GuC
+scheduler backend. This should be fine, as the plan is to support compute mode
+contexts only with GuC scheduler backend (at least initially). This is much
+easier to support with VM_BIND mode compared to the current heavier execbuff
+path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. This is made possible by VM_BIND is not being synchronized against
+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
+submitted jobs.
+
+Other VM_BIND use cases
+========================
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debugged) and attached
+to GPU via vm_bind interface.
+
+GPU page faults
+----------------
+GPU page faults when supported (in future), will only be supported in the
+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
+binding will require using dma-fence to ensure residency, the GPU page faults
+mode when supported, will not use any dma-fence as residency is purely managed
+by installing and removing/invalidating page table entries.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only mapping, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface. SVM is only supported with GPU page
+faults enabled.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+use cases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
+  is active or not.
+
+
+VM_BIND UAPI
+=============
+
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [Intel-gfx] [RFC v3 2/3] drm/i915: Update i915 uapi documentation
  2022-05-17 18:32 ` Niranjana Vishwanathapura
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

Add some missing i915 upai documentation which the new
i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a2def7b27009..8c834a31b56f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
 
 /* Must be kept compact -- no holes and well documented */
 
+/**
+ * typedef drm_i915_getparam_t - Driver parameter query structure.
+ */
 typedef struct drm_i915_getparam {
+	/** @param: Driver parameter to query. */
 	__s32 param;
-	/*
+
+	/**
+	 * @value: Address of memory where queried value should be put.
+	 *
 	 * WARNING: Using pointers instead of fixed-size u64 means we need to write
 	 * compat32 code. Don't repeat this mistake.
 	 */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
 	__u64 rsvd2;
 };
 
+/**
+ * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
+ * ioctl.
+ *
+ * The request will wait for input fence to signal before submission.
+ *
+ * The returned output fence will be signaled after the completion of the
+ * request.
+ */
 struct drm_i915_gem_exec_fence {
-	/**
-	 * User's handle for a drm_syncobj to wait on or signal.
-	 */
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
 	__u32 handle;
 
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_EXEC_FENCE_WAIT:
+	 * Wait for the input fence before request submission.
+	 *
+	 * I915_EXEC_FENCE_SIGNAL:
+	 * Return request completion fence as output
+	 */
+	__u32 flags;
 #define I915_EXEC_FENCE_WAIT            (1<<0)
 #define I915_EXEC_FENCE_SIGNAL          (1<<1)
 #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
-	__u32 flags;
 };
 
-/*
- * See drm_i915_gem_execbuffer_ext_timeline_fences.
- */
-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
-
-/*
+/**
+ * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
+ * for execbuff.
+ *
  * This structure describes an array of drm_syncobj and associated points for
  * timeline variants of drm_syncobj. It is invalid to append this structure to
  * the execbuf if I915_EXEC_FENCE_ARRAY is set.
  */
 struct drm_i915_gem_execbuffer_ext_timeline_fences {
+#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
+	/** @base: Extension link. See struct i915_user_extension. */
 	struct i915_user_extension base;
 
 	/**
-	 * Number of element in the handles_ptr & value_ptr arrays.
+	 * @fence_count: Number of element in the @handles_ptr & @value_ptr
+	 * arrays.
 	 */
 	__u64 fence_count;
 
 	/**
-	 * Pointer to an array of struct drm_i915_gem_exec_fence of length
-	 * fence_count.
+	 * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
+	 * of length @fence_count.
 	 */
 	__u64 handles_ptr;
 
 	/**
-	 * Pointer to an array of u64 values of length fence_count. Values
-	 * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
-	 * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
 	 */
 	__u64 values_ptr;
 };
 
+/**
+ * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
+ */
 struct drm_i915_gem_execbuffer2 {
-	/**
-	 * List of gem_exec_object2 structs
-	 */
+	/** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
 	__u64 buffers_ptr;
+
+	/** @buffer_count: Number of elements in @buffers_ptr array */
 	__u32 buffer_count;
 
-	/** Offset in the batchbuffer to start execution from. */
+	/**
+	 * @batch_start_offset: Offset in the batchbuffer to start execution
+	 * from.
+	 */
 	__u32 batch_start_offset;
-	/** Bytes used in batchbuffer from batch_start_offset */
+
+	/** @batch_len: Bytes used in batchbuffer from batch_start_offset */
 	__u32 batch_len;
+
+	/** @DR1: deprecated */
 	__u32 DR1;
+
+	/** @DR4: deprecated */
 	__u32 DR4;
+
+	/** @num_cliprects: See @cliprects_ptr */
 	__u32 num_cliprects;
+
 	/**
-	 * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
-	 * & I915_EXEC_USE_EXTENSIONS are not set.
+	 * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
+	 *
+	 * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
+	 * I915_EXEC_USE_EXTENSIONS flags are not set.
 	 *
 	 * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
-	 * of struct drm_i915_gem_exec_fence and num_cliprects is the length
-	 * of the array.
+	 * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
+	 * array.
 	 *
 	 * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
-	 * single struct i915_user_extension and num_cliprects is 0.
+	 * single &i915_user_extension and num_cliprects is 0.
 	 */
 	__u64 cliprects_ptr;
+
+	/** @flags: Execbuff flags */
+	__u64 flags;
 #define I915_EXEC_RING_MASK              (0x3f)
 #define I915_EXEC_DEFAULT                (0<<0)
 #define I915_EXEC_RENDER                 (1<<0)
@@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
 #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
 #define I915_EXEC_CONSTANTS_ABSOLUTE 	(1<<6)
 #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
-	__u64 flags;
-	__u64 rsvd1; /* now used for context info */
-	__u64 rsvd2;
-};
 
 /** Resets the SO write offset registers for transform feedback on gen7. */
 #define I915_EXEC_GEN7_SOL_RESET	(1<<8)
@@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
  * drm_i915_gem_execbuffer_ext enum.
  */
 #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
-
 #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
 
+	/** @rsvd1: Context id */
+	__u64 rsvd1;
+
+	/**
+	 * @rsvd2: in and out sync_file file descriptors.
+	 *
+	 * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
+	 * lower 32 bits of this field will have the in sync_file fd (input).
+	 *
+	 * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
+	 * field will have the out sync_file fd (output).
+	 */
+	__u64 rsvd2;
+};
+
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
 	(eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
@@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
 	__u32 pad;
 };
 
+/**
+ * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
+ */
 struct drm_i915_gem_context_create_ext {
-	__u32 ctx_id; /* output: id of new context*/
+	/** @ctx_id: Id of the created context (output) */
+	__u32 ctx_id;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
+	 *
+	 * Extensions may be appended to this structure and driver must check
+	 * for those.
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
+	 *
+	 * Created context will have single timeline.
+	 */
 	__u32 flags;
 #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS	(1u << 0)
 #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE	(1u << 1)
 #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
 	(-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
+
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
 };
 
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
 	__u32 pad;
 };
 
-/*
+/**
+ * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
+ *
  * DRM_I915_GEM_VM_CREATE -
  *
  * Create a new virtual memory address space (ppGTT) for use within a context
@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
  * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
  * returned in the outparam @id.
  *
- * No flags are defined, with all bits reserved and must be zero.
- *
  * An extension chain maybe provided, starting with @extensions, and terminated
  * by the @next_extension being 0. Currently, no extensions are defined.
  *
  * DRM_I915_GEM_VM_DESTROY -
  *
- * Destroys a previously created VM id, specified in @id.
+ * Destroys a previously created VM id, specified in @vm_id.
  *
  * No extensions or flags are allowed currently, and so must be zero.
  */
 struct drm_i915_gem_vm_control {
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
+
+	/** @flags: reserved for future usage, currently MBZ */
 	__u32 flags;
+
+	/** @vm_id: Id of the VM created or to be destroyed */
 	__u32 vm_id;
 };
 
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC v3 2/3] drm/i915: Update i915 uapi documentation
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

Add some missing i915 upai documentation which the new
i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a2def7b27009..8c834a31b56f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
 
 /* Must be kept compact -- no holes and well documented */
 
+/**
+ * typedef drm_i915_getparam_t - Driver parameter query structure.
+ */
 typedef struct drm_i915_getparam {
+	/** @param: Driver parameter to query. */
 	__s32 param;
-	/*
+
+	/**
+	 * @value: Address of memory where queried value should be put.
+	 *
 	 * WARNING: Using pointers instead of fixed-size u64 means we need to write
 	 * compat32 code. Don't repeat this mistake.
 	 */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 {
 	__u64 rsvd2;
 };
 
+/**
+ * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
+ * ioctl.
+ *
+ * The request will wait for input fence to signal before submission.
+ *
+ * The returned output fence will be signaled after the completion of the
+ * request.
+ */
 struct drm_i915_gem_exec_fence {
-	/**
-	 * User's handle for a drm_syncobj to wait on or signal.
-	 */
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
 	__u32 handle;
 
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_EXEC_FENCE_WAIT:
+	 * Wait for the input fence before request submission.
+	 *
+	 * I915_EXEC_FENCE_SIGNAL:
+	 * Return request completion fence as output
+	 */
+	__u32 flags;
 #define I915_EXEC_FENCE_WAIT            (1<<0)
 #define I915_EXEC_FENCE_SIGNAL          (1<<1)
 #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
-	__u32 flags;
 };
 
-/*
- * See drm_i915_gem_execbuffer_ext_timeline_fences.
- */
-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
-
-/*
+/**
+ * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
+ * for execbuff.
+ *
  * This structure describes an array of drm_syncobj and associated points for
  * timeline variants of drm_syncobj. It is invalid to append this structure to
  * the execbuf if I915_EXEC_FENCE_ARRAY is set.
  */
 struct drm_i915_gem_execbuffer_ext_timeline_fences {
+#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
+	/** @base: Extension link. See struct i915_user_extension. */
 	struct i915_user_extension base;
 
 	/**
-	 * Number of element in the handles_ptr & value_ptr arrays.
+	 * @fence_count: Number of element in the @handles_ptr & @value_ptr
+	 * arrays.
 	 */
 	__u64 fence_count;
 
 	/**
-	 * Pointer to an array of struct drm_i915_gem_exec_fence of length
-	 * fence_count.
+	 * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
+	 * of length @fence_count.
 	 */
 	__u64 handles_ptr;
 
 	/**
-	 * Pointer to an array of u64 values of length fence_count. Values
-	 * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
-	 * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
 	 */
 	__u64 values_ptr;
 };
 
+/**
+ * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
+ */
 struct drm_i915_gem_execbuffer2 {
-	/**
-	 * List of gem_exec_object2 structs
-	 */
+	/** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
 	__u64 buffers_ptr;
+
+	/** @buffer_count: Number of elements in @buffers_ptr array */
 	__u32 buffer_count;
 
-	/** Offset in the batchbuffer to start execution from. */
+	/**
+	 * @batch_start_offset: Offset in the batchbuffer to start execution
+	 * from.
+	 */
 	__u32 batch_start_offset;
-	/** Bytes used in batchbuffer from batch_start_offset */
+
+	/** @batch_len: Bytes used in batchbuffer from batch_start_offset */
 	__u32 batch_len;
+
+	/** @DR1: deprecated */
 	__u32 DR1;
+
+	/** @DR4: deprecated */
 	__u32 DR4;
+
+	/** @num_cliprects: See @cliprects_ptr */
 	__u32 num_cliprects;
+
 	/**
-	 * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
-	 * & I915_EXEC_USE_EXTENSIONS are not set.
+	 * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
+	 *
+	 * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
+	 * I915_EXEC_USE_EXTENSIONS flags are not set.
 	 *
 	 * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
-	 * of struct drm_i915_gem_exec_fence and num_cliprects is the length
-	 * of the array.
+	 * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
+	 * array.
 	 *
 	 * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
-	 * single struct i915_user_extension and num_cliprects is 0.
+	 * single &i915_user_extension and num_cliprects is 0.
 	 */
 	__u64 cliprects_ptr;
+
+	/** @flags: Execbuff flags */
+	__u64 flags;
 #define I915_EXEC_RING_MASK              (0x3f)
 #define I915_EXEC_DEFAULT                (0<<0)
 #define I915_EXEC_RENDER                 (1<<0)
@@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 {
 #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */
 #define I915_EXEC_CONSTANTS_ABSOLUTE 	(1<<6)
 #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
-	__u64 flags;
-	__u64 rsvd1; /* now used for context info */
-	__u64 rsvd2;
-};
 
 /** Resets the SO write offset registers for transform feedback on gen7. */
 #define I915_EXEC_GEN7_SOL_RESET	(1<<8)
@@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
  * drm_i915_gem_execbuffer_ext enum.
  */
 #define I915_EXEC_USE_EXTENSIONS	(1 << 21)
-
 #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
 
+	/** @rsvd1: Context id */
+	__u64 rsvd1;
+
+	/**
+	 * @rsvd2: in and out sync_file file descriptors.
+	 *
+	 * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
+	 * lower 32 bits of this field will have the in sync_file fd (input).
+	 *
+	 * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
+	 * field will have the out sync_file fd (output).
+	 */
+	__u64 rsvd2;
+};
+
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
 	(eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK
@@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create {
 	__u32 pad;
 };
 
+/**
+ * struct drm_i915_gem_context_create_ext - Structure for creating contexts.
+ */
 struct drm_i915_gem_context_create_ext {
-	__u32 ctx_id; /* output: id of new context*/
+	/** @ctx_id: Id of the created context (output) */
+	__u32 ctx_id;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
+	 *
+	 * Extensions may be appended to this structure and driver must check
+	 * for those.
+	 *
+	 * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
+	 *
+	 * Created context will have single timeline.
+	 */
 	__u32 flags;
 #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS	(1u << 0)
 #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE	(1u << 1)
 #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \
 	(-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
+
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
 };
 
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy {
 	__u32 pad;
 };
 
-/*
+/**
+ * struct drm_i915_gem_vm_control - Structure to create or destroy VM.
+ *
  * DRM_I915_GEM_VM_CREATE -
  *
  * Create a new virtual memory address space (ppGTT) for use within a context
@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
  * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
  * returned in the outparam @id.
  *
- * No flags are defined, with all bits reserved and must be zero.
- *
  * An extension chain maybe provided, starting with @extensions, and terminated
  * by the @next_extension being 0. Currently, no extensions are defined.
  *
  * DRM_I915_GEM_VM_DESTROY -
  *
- * Destroys a previously created VM id, specified in @id.
+ * Destroys a previously created VM id, specified in @vm_id.
  *
  * No extensions or flags are allowed currently, and so must be zero.
  */
 struct drm_i915_gem_vm_control {
+	/** @extensions: Zero-terminated chain of extensions. */
 	__u64 extensions;
+
+	/** @flags: reserved for future usage, currently MBZ */
 	__u32 flags;
+
+	/** @vm_id: Id of the VM created or to be destroyed */
 	__u32 vm_id;
 };
 
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32 ` Niranjana Vishwanathapura
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, chris.p.wilson, christian.koenig

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.
    Also add new uapi and documentation as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
 1 file changed, 399 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index 000000000000..589c0a009107
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,399 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ */
+#define I915_PARAM_HAS_VM_BIND		57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * A VM in VM_BIND mode will not support the older execbuff mode of binding.
+ * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
+ * to pass in the batch buffer addresses.
+ *
+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+ * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
+ * to be not used.
+ *
+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
+ * to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ */
+struct drm_i915_gem_vm_bind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @handle: Object handle */
+	__u32 handle;
+
+	/** @start: Virtual Address start to bind */
+	__u64 start;
+
+	/** @offset: Offset in object to bind */
+	__u64 offset;
+
+	/** @length: Length of mapping to bind */
+	__u64 length;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_GEM_VM_BIND_READONLY:
+	 * Mapping is read-only.
+	 *
+	 * I915_GEM_VM_BIND_CAPTURE:
+	 * Capture this mapping in the dump upon GPU error.
+	 */
+	__u64 flags;
+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
+ *
+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
+ * address (VA) range that should be unbound from the device page table of the
+ * specified address space (VM). The specified VA range must match one of the
+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
+ * completion.
+ */
+struct drm_i915_gem_vm_unbind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @rsvd: Reserved for future use; must be zero. */
+	__u32 rsvd;
+
+	/** @start: Virtual Address start to unbind */
+	__u64 start;
+
+	/** @length: Length of mapping to unbind */
+	__u64 length;
+
+	/** @flags: reserved for future usage, currently MBZ */
+	__u64 flags;
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
+ * or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
+ * before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the returned output fence
+ * after the completion of binding or unbinding.
+ */
+struct drm_i915_vm_bind_fence {
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
+	__u32 handle;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
+ * and vm_unbind.
+ *
+ * This structure describes an array of timeline drm_syncobj and associated
+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
+ */
+struct drm_i915_vm_bind_ext_timeline_fences {
+#define I915_VM_BIND_EXT_timeline_FENCES	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
+	 * arrays.
+	 */
+	__u64 fence_count;
+
+	/**
+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
+	 * of length @fence_count.
+	 */
+	__u64 handles_ptr;
+
+	/**
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
+	 */
+	__u64 values_ptr;
+};
+
+/**
+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
+ * vm_bind or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
+ * @addr to become equal to @val) before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the output fence after
+ * the completion of binding or unbinding by writing @val to memory location at
+ * @addr
+ */
+struct drm_i915_vm_bind_user_fence {
+	/** @addr: User/Memory fence qword aligned process virtual address */
+	__u64 addr;
+
+	/** @val: User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_USER_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
+ * and vm_unbind.
+ *
+ * These user fences can be input or output fences
+ * (See struct drm_i915_vm_bind_user_fence).
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
+	__u64 fence_count;
+
+	/**
+	 * @user_fence_ptr: Pointer to an array of
+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
+	 */
+	__u64 user_fence_ptr;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
+ * gpu virtual addresses.
+ *
+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
+ * must always be appended in the VM_BIND mode and it will be an error to
+ * append this extension in older non-VM_BIND mode.
+ */
+struct drm_i915_gem_execbuffer_ext_batch_addresses {
+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @count: Number of addresses in the addr array. */
+	__u32 count;
+
+	/** @addr: An array of batch gpu virtual addresses. */
+	__u64 addr[0];
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (@addr, @value pair) to an
+ * execbuf to be signaled by the command streamer after the completion of first
+ * level batch, by writing the @value at specified @addr and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @addr: User/Memory fence qword aligned GPU virtual address.
+	 *
+	 * Address has to be a valid GPU virtual address at the time of
+	 * first level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * @value: User/Memory fence Value to be written to above address
+	 * after first level batch completes.
+	 */
+	__u64 value;
+
+	/** @rsvd: Reserved for future extensions, MBZ */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
+ * private to the specified VM.
+ *
+ * See struct drm_i915_gem_create_ext.
+ */
+struct drm_i915_gem_create_ext_vm_private {
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @vm_id: Id of the VM to which the object is private */
+	__u32 vm_id;
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
+ *
+ * User/Memory fence can be woken up either by:
+ *
+ * 1. GPU context indicated by @ctx_id, or,
+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *    @ctx_id is ignored when this flag is set.
+ *
+ * Wakeup condition is,
+ * ``((*addr & mask) op (value & mask))``
+ *
+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
+ */
+struct drm_i915_gem_wait_user_fence {
+	/** @extensions: Zero-terminated chain of extensions. */
+	__u64 extensions;
+
+	/** @addr: User/Memory fence address */
+	__u64 addr;
+
+	/** @ctx_id: Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** @op: Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_UFENCE_WAIT_SOFT:
+	 *
+	 * To be woken up by i915 driver async worker (not by GPU).
+	 *
+	 * I915_UFENCE_WAIT_ABSTIME:
+	 *
+	 * Wait timeout specified as absolute time.
+	 */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** @value: Wakeup value */
+	__u64 value;
+
+	/** @mask: Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/**
+	 * @timeout: Wait timeout in nanoseconds.
+	 *
+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
+	 * absolute time in nsec.
+	 */
+	__s64 timeout;
+};
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-05-17 18:32   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-17 18:32 UTC (permalink / raw)
  To: intel-gfx, dri-devel, daniel.vetter
  Cc: matthew.brost, thomas.hellstrom, jason, chris.p.wilson, christian.koenig

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.
    Also add new uapi and documentation as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
 1 file changed, 399 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index 000000000000..589c0a009107
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,399 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ */
+#define I915_PARAM_HAS_VM_BIND		57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * A VM in VM_BIND mode will not support the older execbuff mode of binding.
+ * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
+ * to pass in the batch buffer addresses.
+ *
+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+ * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
+ * to be not used.
+ *
+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
+ * to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ */
+struct drm_i915_gem_vm_bind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @handle: Object handle */
+	__u32 handle;
+
+	/** @start: Virtual Address start to bind */
+	__u64 start;
+
+	/** @offset: Offset in object to bind */
+	__u64 offset;
+
+	/** @length: Length of mapping to bind */
+	__u64 length;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_GEM_VM_BIND_READONLY:
+	 * Mapping is read-only.
+	 *
+	 * I915_GEM_VM_BIND_CAPTURE:
+	 * Capture this mapping in the dump upon GPU error.
+	 */
+	__u64 flags;
+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
+ *
+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
+ * address (VA) range that should be unbound from the device page table of the
+ * specified address space (VM). The specified VA range must match one of the
+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
+ * completion.
+ */
+struct drm_i915_gem_vm_unbind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @rsvd: Reserved for future use; must be zero. */
+	__u32 rsvd;
+
+	/** @start: Virtual Address start to unbind */
+	__u64 start;
+
+	/** @length: Length of mapping to unbind */
+	__u64 length;
+
+	/** @flags: reserved for future usage, currently MBZ */
+	__u64 flags;
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
+ * or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
+ * before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the returned output fence
+ * after the completion of binding or unbinding.
+ */
+struct drm_i915_vm_bind_fence {
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
+	__u32 handle;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
+ * and vm_unbind.
+ *
+ * This structure describes an array of timeline drm_syncobj and associated
+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
+ */
+struct drm_i915_vm_bind_ext_timeline_fences {
+#define I915_VM_BIND_EXT_timeline_FENCES	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
+	 * arrays.
+	 */
+	__u64 fence_count;
+
+	/**
+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
+	 * of length @fence_count.
+	 */
+	__u64 handles_ptr;
+
+	/**
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
+	 */
+	__u64 values_ptr;
+};
+
+/**
+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
+ * vm_bind or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
+ * @addr to become equal to @val) before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the output fence after
+ * the completion of binding or unbinding by writing @val to memory location at
+ * @addr
+ */
+struct drm_i915_vm_bind_user_fence {
+	/** @addr: User/Memory fence qword aligned process virtual address */
+	__u64 addr;
+
+	/** @val: User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_USER_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
+ * and vm_unbind.
+ *
+ * These user fences can be input or output fences
+ * (See struct drm_i915_vm_bind_user_fence).
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
+	__u64 fence_count;
+
+	/**
+	 * @user_fence_ptr: Pointer to an array of
+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
+	 */
+	__u64 user_fence_ptr;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
+ * gpu virtual addresses.
+ *
+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
+ * must always be appended in the VM_BIND mode and it will be an error to
+ * append this extension in older non-VM_BIND mode.
+ */
+struct drm_i915_gem_execbuffer_ext_batch_addresses {
+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @count: Number of addresses in the addr array. */
+	__u32 count;
+
+	/** @addr: An array of batch gpu virtual addresses. */
+	__u64 addr[0];
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (@addr, @value pair) to an
+ * execbuf to be signaled by the command streamer after the completion of first
+ * level batch, by writing the @value at specified @addr and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @addr: User/Memory fence qword aligned GPU virtual address.
+	 *
+	 * Address has to be a valid GPU virtual address at the time of
+	 * first level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * @value: User/Memory fence Value to be written to above address
+	 * after first level batch completes.
+	 */
+	__u64 value;
+
+	/** @rsvd: Reserved for future extensions, MBZ */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
+ * private to the specified VM.
+ *
+ * See struct drm_i915_gem_create_ext.
+ */
+struct drm_i915_gem_create_ext_vm_private {
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @vm_id: Id of the VM to which the object is private */
+	__u32 vm_id;
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
+ *
+ * User/Memory fence can be woken up either by:
+ *
+ * 1. GPU context indicated by @ctx_id, or,
+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *    @ctx_id is ignored when this flag is set.
+ *
+ * Wakeup condition is,
+ * ``((*addr & mask) op (value & mask))``
+ *
+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
+ */
+struct drm_i915_gem_wait_user_fence {
+	/** @extensions: Zero-terminated chain of extensions. */
+	__u64 extensions;
+
+	/** @addr: User/Memory fence address */
+	__u64 addr;
+
+	/** @ctx_id: Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** @op: Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_UFENCE_WAIT_SOFT:
+	 *
+	 * To be woken up by i915 driver async worker (not by GPU).
+	 *
+	 * I915_UFENCE_WAIT_ABSTIME:
+	 *
+	 * Wait timeout specified as absolute time.
+	 */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** @value: Wakeup value */
+	__u64 value;
+
+	/** @mask: Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/**
+	 * @timeout: Wait timeout in nanoseconds.
+	 *
+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
+	 * absolute time in nsec.
+	 */
+	__s64 timeout;
+};
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (3 preceding siblings ...)
  (?)
@ 2022-05-17 20:49 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-17 20:49 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : warning

== Summary ==

Error: dim checkpatch failed
b4f01b5605b4 drm/doc/rfc: VM_BIND feature design document
-:27: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#27: 
new file mode 100644

-:32: WARNING:SPDX_LICENSE_TAG: Missing or malformed SPDX-License-Identifier tag in line 1
#32: FILE: Documentation/gpu/rfc/i915_vm_bind.rst:1:
+==========================================

total: 0 errors, 2 warnings, 0 checks, 319 lines checked
50b7adcbd762 drm/i915: Update i915 uapi documentation
e72771b5018c drm/doc/rfc: VM_BIND uapi definition
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 6, in <module>
    from ply import lex, yacc
ModuleNotFoundError: No module named 'ply'
-:15: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#15: 
new file mode 100644

-:83: WARNING:LONG_LINE: line length of 126 exceeds 100 columns
#83: FILE: Documentation/gpu/rfc/i915_vm_bind.h:64:
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)

-:84: WARNING:LONG_LINE: line length of 128 exceeds 100 columns
#84: FILE: Documentation/gpu/rfc/i915_vm_bind.h:65:
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)

-:85: WARNING:LONG_LINE: line length of 142 exceeds 100 columns
#85: FILE: Documentation/gpu/rfc/i915_vm_bind.h:66:
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

-:184: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#184: FILE: Documentation/gpu/rfc/i915_vm_bind.h:165:
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
                                              ^

-:185: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#185: FILE: Documentation/gpu/rfc/i915_vm_bind.h:166:
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
                                              ^

-:252: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#252: FILE: Documentation/gpu/rfc/i915_vm_bind.h:233:
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
                                                   ^

-:253: CHECK:SPACING: spaces preferred around that '<<' (ctx:VxV)
#253: FILE: Documentation/gpu/rfc/i915_vm_bind.h:234:
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
                                                   ^

total: 0 errors, 4 warnings, 4 checks, 399 lines checked



^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (4 preceding siblings ...)
  (?)
@ 2022-05-17 20:49 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-17 20:49 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (5 preceding siblings ...)
  (?)
@ 2022-05-17 21:09 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-17 21:09 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 7689 bytes --]

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11668 -> Patchwork_93447v3
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/index.html

Participating hosts (40 -> 40)
------------------------------

  Additional (2): fi-kbl-soraka bat-dg2-8 
  Missing    (2): fi-rkl-11600 bat-dg2-9 

Known issues
------------

  Here are the changes found in Patchwork_93447v3 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_fence@basic-busy@bcs0:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][1] ([fdo#109271]) +9 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@gem_exec_fence@basic-busy@bcs0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][2] ([fdo#109271] / [i915#2190])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@basic:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][3] ([fdo#109271] / [i915#4613]) +3 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@gem_lmem_swapping@basic.html

  * igt@i915_selftest@live@gem:
    - fi-blb-e6850:       NOTRUN -> [DMESG-FAIL][4] ([i915#4528])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-blb-e6850/igt@i915_selftest@live@gem.html

  * igt@i915_selftest@live@gem_migrate:
    - fi-bdw-5557u:       [PASS][5] -> [INCOMPLETE][6] ([i915#5716])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/fi-bdw-5557u/igt@i915_selftest@live@gem_migrate.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-bdw-5557u/igt@i915_selftest@live@gem_migrate.html

  * igt@i915_selftest@live@gt_pm:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][7] ([i915#1886])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@i915_selftest@live@gt_pm.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-hsw-4770:        NOTRUN -> [SKIP][8] ([fdo#109271] / [fdo#111827])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-hsw-4770/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@dp-edid-read:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][9] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@kms_chamelium@dp-edid-read.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][10] ([fdo#109271] / [i915#533])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-kbl-soraka/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@hangcheck:
    - bat-dg1-5:          [DMESG-FAIL][11] ([i915#4494] / [i915#4957]) -> [PASS][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/bat-dg1-5/igt@i915_selftest@live@hangcheck.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/bat-dg1-5/igt@i915_selftest@live@hangcheck.html
    - fi-hsw-4770:        [INCOMPLETE][13] ([i915#4785]) -> [PASS][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html

  * igt@i915_selftest@live@requests:
    - fi-blb-e6850:       [DMESG-FAIL][15] ([i915#4528]) -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/fi-blb-e6850/igt@i915_selftest@live@requests.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/fi-blb-e6850/igt@i915_selftest@live@requests.html

  * igt@kms_flip@basic-flip-vs-modeset@a-edp1:
    - {bat-adlp-6}:       [DMESG-WARN][17] ([i915#3576]) -> [PASS][18] +1 similar issue
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1886]: https://gitlab.freedesktop.org/drm/intel/issues/1886
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
  [i915#3547]: https://gitlab.freedesktop.org/drm/intel/issues/3547
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3576]: https://gitlab.freedesktop.org/drm/intel/issues/3576
  [i915#3595]: https://gitlab.freedesktop.org/drm/intel/issues/3595
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
  [i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213
  [i915#4215]: https://gitlab.freedesktop.org/drm/intel/issues/4215
  [i915#4494]: https://gitlab.freedesktop.org/drm/intel/issues/4494
  [i915#4528]: https://gitlab.freedesktop.org/drm/intel/issues/4528
  [i915#4579]: https://gitlab.freedesktop.org/drm/intel/issues/4579
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4785]: https://gitlab.freedesktop.org/drm/intel/issues/4785
  [i915#4873]: https://gitlab.freedesktop.org/drm/intel/issues/4873
  [i915#4957]: https://gitlab.freedesktop.org/drm/intel/issues/4957
  [i915#5190]: https://gitlab.freedesktop.org/drm/intel/issues/5190
  [i915#5274]: https://gitlab.freedesktop.org/drm/intel/issues/5274
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#5354]: https://gitlab.freedesktop.org/drm/intel/issues/5354
  [i915#5716]: https://gitlab.freedesktop.org/drm/intel/issues/5716
  [i915#5763]: https://gitlab.freedesktop.org/drm/intel/issues/5763
  [i915#5879]: https://gitlab.freedesktop.org/drm/intel/issues/5879
  [i915#5903]: https://gitlab.freedesktop.org/drm/intel/issues/5903


Build changes
-------------

  * Linux: CI_DRM_11668 -> Patchwork_93447v3

  CI-20190529: 20190529
  CI_DRM_11668: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6477: 70cfef35851891aeaa829f5e8dcb7fd43b454bde @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_93447v3: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

45b73cbaf9c4 drm/doc/rfc: VM_BIND uapi definition
8b2e89a9b707 drm/i915: Update i915 uapi documentation
f1f39186a68b drm/doc/rfc: VM_BIND feature design document

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/index.html

[-- Attachment #2: Type: text/html, Size: 7460 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
  2022-05-17 18:32 ` Niranjana Vishwanathapura
                   ` (6 preceding siblings ...)
  (?)
@ 2022-05-18  2:33 ` Patchwork
  -1 siblings, 0 replies; 121+ messages in thread
From: Patchwork @ 2022-05-18  2:33 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 55368 bytes --]

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev3)
URL   : https://patchwork.freedesktop.org/series/93447/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_11668_full -> Patchwork_93447v3_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_93447v3_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_93447v3_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (11 -> 11)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_93447v3_full:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_dither@fb-8bpc-vs-panel-6bpc@edp-1-pipe-a:
    - shard-iclb:         NOTRUN -> [SKIP][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_dither@fb-8bpc-vs-panel-6bpc@edp-1-pipe-a.html

  
Known issues
------------

  Here are the changes found in Patchwork_93447v3_full that come from known issues:

### CI changes ###

#### Issues hit ####

  * boot:
    - shard-apl:          ([PASS][2], [PASS][3], [PASS][4], [PASS][5], [PASS][6], [PASS][7], [PASS][8], [PASS][9], [PASS][10], [PASS][11], [PASS][12], [PASS][13], [PASS][14], [PASS][15], [PASS][16], [PASS][17], [PASS][18], [PASS][19], [PASS][20], [PASS][21], [PASS][22], [PASS][23], [PASS][24], [PASS][25], [PASS][26]) -> ([PASS][27], [PASS][28], [PASS][29], [PASS][30], [PASS][31], [PASS][32], [PASS][33], [PASS][34], [PASS][35], [PASS][36], [PASS][37], [PASS][38], [PASS][39], [PASS][40], [PASS][41], [PASS][42], [PASS][43], [PASS][44], [PASS][45], [FAIL][46], [PASS][47], [PASS][48], [PASS][49], [PASS][50], [PASS][51]) ([i915#4386])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/boot.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/boot.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/boot.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/boot.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl6/boot.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/boot.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/boot.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/boot.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/boot.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl2/boot.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl2/boot.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl2/boot.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/boot.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/boot.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/boot.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/boot.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/boot.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/boot.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/boot.html
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/boot.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/boot.html
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/boot.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/boot.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/boot.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/boot.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/boot.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/boot.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/boot.html

  

### IGT changes ###

#### Issues hit ####

  * igt@gem_ccs@ctrl-surf-copy:
    - shard-iclb:         NOTRUN -> [SKIP][52] ([i915#5327])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_ccs@ctrl-surf-copy.html

  * igt@gem_ccs@suspend-resume:
    - shard-kbl:          NOTRUN -> [SKIP][53] ([fdo#109271]) +27 similar issues
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@gem_ccs@suspend-resume.html

  * igt@gem_eio@unwedge-stress:
    - shard-snb:          NOTRUN -> [FAIL][54] ([i915#3354])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-none-solo@rcs0:
    - shard-apl:          [PASS][55] -> [FAIL][56] ([i915#2842])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/igt@gem_exec_fair@basic-none-solo@rcs0.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/igt@gem_exec_fair@basic-none-solo@rcs0.html
    - shard-iclb:         NOTRUN -> [FAIL][57] ([i915#2842])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@gem_exec_fair@basic-none-solo@rcs0.html

  * igt@gem_exec_fair@basic-none@rcs0:
    - shard-kbl:          [PASS][58] -> [FAIL][59] ([i915#2842]) +5 similar issues
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@gem_exec_fair@basic-none@rcs0.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@gem_exec_fair@basic-none@rcs0.html

  * igt@gem_exec_fair@basic-pace@bcs0:
    - shard-iclb:         [PASS][60] -> [FAIL][61] ([i915#2842])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@gem_exec_fair@basic-pace@bcs0.html
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@gem_exec_fair@basic-pace@bcs0.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-iclb:         [PASS][62] -> [FAIL][63] ([i915#2849])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb8/igt@gem_exec_fair@basic-throttle@rcs0.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb5/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_exec_flush@basic-uc-prw-default:
    - shard-snb:          [PASS][64] -> [SKIP][65] ([fdo#109271]) +2 similar issues
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb2/igt@gem_exec_flush@basic-uc-prw-default.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb6/igt@gem_exec_flush@basic-uc-prw-default.html

  * igt@gem_lmem_swapping@parallel-multi:
    - shard-apl:          NOTRUN -> [SKIP][66] ([fdo#109271] / [i915#4613])
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/igt@gem_lmem_swapping@parallel-multi.html

  * igt@gem_lmem_swapping@verify-random:
    - shard-skl:          NOTRUN -> [SKIP][67] ([fdo#109271] / [i915#4613]) +1 similar issue
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@gem_lmem_swapping@verify-random.html

  * igt@gem_mmap_gtt@fault-concurrent-y:
    - shard-snb:          [PASS][68] -> [INCOMPLETE][69] ([i915#5161])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb4/igt@gem_mmap_gtt@fault-concurrent-y.html
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@gem_mmap_gtt@fault-concurrent-y.html

  * igt@gem_pxp@protected-raw-src-copy-not-readible:
    - shard-iclb:         NOTRUN -> [SKIP][70] ([i915#4270])
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@gem_pxp@protected-raw-src-copy-not-readible.html

  * igt@gem_render_copy@y-tiled-mc-ccs-to-vebox-y-tiled:
    - shard-iclb:         NOTRUN -> [SKIP][71] ([i915#768])
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_render_copy@y-tiled-mc-ccs-to-vebox-y-tiled.html

  * igt@gem_userptr_blits@dmabuf-sync:
    - shard-iclb:         NOTRUN -> [SKIP][72] ([i915#3323])
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_userptr_blits@dmabuf-sync.html

  * igt@gem_userptr_blits@dmabuf-unsync:
    - shard-iclb:         NOTRUN -> [SKIP][73] ([i915#3297])
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gem_userptr_blits@dmabuf-unsync.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-skl:          NOTRUN -> [FAIL][74] ([i915#3318])
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@gem_userptr_blits@vma-merge.html

  * igt@gen3_mixed_blits:
    - shard-iclb:         NOTRUN -> [SKIP][75] ([fdo#109289]) +1 similar issue
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gen3_mixed_blits.html

  * igt@gen9_exec_parse@bb-oversize:
    - shard-tglb:         NOTRUN -> [SKIP][76] ([i915#2527] / [i915#2856])
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@gen9_exec_parse@bb-oversize.html

  * igt@gen9_exec_parse@bb-start-out:
    - shard-iclb:         NOTRUN -> [SKIP][77] ([i915#2856])
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@gen9_exec_parse@bb-start-out.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-skl:          NOTRUN -> [FAIL][78] ([i915#454])
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@i915_pm_dc@dc6-psr.html

  * igt@i915_pm_rc6_residency@rc6-fence:
    - shard-iclb:         NOTRUN -> [WARN][79] ([i915#2684])
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@i915_pm_rc6_residency@rc6-fence.html

  * igt@i915_pm_rpm@modeset-non-lpsp-stress:
    - shard-tglb:         NOTRUN -> [SKIP][80] ([fdo#111644] / [i915#1397] / [i915#2411])
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@i915_pm_rpm@modeset-non-lpsp-stress.html

  * igt@kms_atomic_transition@plane-all-modeset-transition-fencing:
    - shard-iclb:         NOTRUN -> [SKIP][81] ([i915#1769])
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_atomic_transition@plane-all-modeset-transition-fencing.html

  * igt@kms_big_fb@4-tiled-16bpp-rotate-0:
    - shard-tglb:         NOTRUN -> [SKIP][82] ([i915#5286])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_big_fb@4-tiled-16bpp-rotate-0.html

  * igt@kms_big_fb@4-tiled-max-hw-stride-64bpp-rotate-0-hflip:
    - shard-iclb:         NOTRUN -> [SKIP][83] ([i915#5286]) +1 similar issue
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_big_fb@4-tiled-max-hw-stride-64bpp-rotate-0-hflip.html

  * igt@kms_big_fb@linear-32bpp-rotate-90:
    - shard-iclb:         NOTRUN -> [SKIP][84] ([fdo#110725] / [fdo#111614]) +1 similar issue
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_big_fb@linear-32bpp-rotate-90.html

  * igt@kms_big_fb@y-tiled-64bpp-rotate-270:
    - shard-tglb:         NOTRUN -> [SKIP][85] ([fdo#111614])
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_big_fb@y-tiled-64bpp-rotate-270.html

  * igt@kms_big_fb@yf-tiled-16bpp-rotate-270:
    - shard-tglb:         NOTRUN -> [SKIP][86] ([fdo#111615])
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_big_fb@yf-tiled-16bpp-rotate-270.html

  * igt@kms_ccs@pipe-a-ccs-on-another-bo-y_tiled_gen12_mc_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][87] ([i915#3689] / [i915#3886])
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_ccs@pipe-a-ccs-on-another-bo-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-a-crc-primary-basic-y_tiled_gen12_rc_ccs_cc:
    - shard-apl:          NOTRUN -> [SKIP][88] ([fdo#109271] / [i915#3886]) +1 similar issue
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@kms_ccs@pipe-a-crc-primary-basic-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs_cc:
    - shard-kbl:          NOTRUN -> [SKIP][89] ([fdo#109271] / [i915#3886])
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-ccs-on-another-bo-yf_tiled_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][90] ([fdo#111615] / [i915#3689])
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_ccs@pipe-b-ccs-on-another-bo-yf_tiled_ccs.html

  * igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          NOTRUN -> [SKIP][91] ([fdo#109271] / [i915#3886]) +5 similar issues
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@kms_ccs@pipe-b-crc-primary-rotation-180-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc:
    - shard-iclb:         NOTRUN -> [SKIP][92] ([fdo#109278] / [i915#3886]) +4 similar issues
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_cdclk@mode-transition:
    - shard-apl:          NOTRUN -> [SKIP][93] ([fdo#109271]) +60 similar issues
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@kms_cdclk@mode-transition.html

  * igt@kms_chamelium@hdmi-edid-read:
    - shard-tglb:         NOTRUN -> [SKIP][94] ([fdo#109284] / [fdo#111827]) +2 similar issues
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_chamelium@hdmi-edid-read.html

  * igt@kms_chamelium@hdmi-hpd-storm-disable:
    - shard-skl:          NOTRUN -> [SKIP][95] ([fdo#109271] / [fdo#111827]) +11 similar issues
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_chamelium@hdmi-hpd-storm-disable.html

  * igt@kms_chamelium@vga-hpd:
    - shard-apl:          NOTRUN -> [SKIP][96] ([fdo#109271] / [fdo#111827]) +5 similar issues
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl2/igt@kms_chamelium@vga-hpd.html

  * igt@kms_color_chamelium@pipe-b-ctm-max:
    - shard-kbl:          NOTRUN -> [SKIP][97] ([fdo#109271] / [fdo#111827]) +1 similar issue
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@kms_color_chamelium@pipe-b-ctm-max.html

  * igt@kms_color_chamelium@pipe-c-ctm-0-5:
    - shard-iclb:         NOTRUN -> [SKIP][98] ([fdo#109284] / [fdo#111827]) +2 similar issues
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_color_chamelium@pipe-c-ctm-0-5.html

  * igt@kms_color_chamelium@pipe-c-ctm-max:
    - shard-snb:          NOTRUN -> [SKIP][99] ([fdo#109271] / [fdo#111827])
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@kms_color_chamelium@pipe-c-ctm-max.html

  * igt@kms_color_chamelium@pipe-d-degamma:
    - shard-iclb:         NOTRUN -> [SKIP][100] ([fdo#109278] / [fdo#109284] / [fdo#111827])
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_color_chamelium@pipe-d-degamma.html

  * igt@kms_content_protection@legacy:
    - shard-apl:          NOTRUN -> [TIMEOUT][101] ([i915#1319])
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@kms_content_protection@legacy.html

  * igt@kms_content_protection@uevent:
    - shard-kbl:          NOTRUN -> [FAIL][102] ([i915#2105])
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@kms_content_protection@uevent.html

  * igt@kms_cursor_crc@pipe-b-cursor-32x10-rapid-movement:
    - shard-iclb:         NOTRUN -> [SKIP][103] ([fdo#109278]) +14 similar issues
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_cursor_crc@pipe-b-cursor-32x10-rapid-movement.html

  * igt@kms_cursor_crc@pipe-c-cursor-512x170-sliding:
    - shard-tglb:         NOTRUN -> [SKIP][104] ([fdo#109279] / [i915#3359] / [i915#5691])
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_cursor_crc@pipe-c-cursor-512x170-sliding.html

  * igt@kms_cursor_crc@pipe-c-cursor-max-size-onscreen:
    - shard-tglb:         NOTRUN -> [SKIP][105] ([i915#3359]) +1 similar issue
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_cursor_crc@pipe-c-cursor-max-size-onscreen.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-toggle:
    - shard-iclb:         NOTRUN -> [SKIP][106] ([fdo#109274] / [fdo#109278])
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_cursor_legacy@cursora-vs-flipb-toggle.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-glk:          [PASS][107] -> [FAIL][108] ([i915#2346] / [i915#533])
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-glk7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-glk7/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@pipe-d-torture-move:
    - shard-skl:          NOTRUN -> [SKIP][109] ([fdo#109271]) +117 similar issues
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_cursor_legacy@pipe-d-torture-move.html

  * igt@kms_cursor_legacy@short-busy-flip-before-cursor-toggle:
    - shard-snb:          NOTRUN -> [SKIP][110] ([fdo#109271]) +58 similar issues
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@kms_cursor_legacy@short-busy-flip-before-cursor-toggle.html

  * igt@kms_draw_crc@draw-method-xrgb2101010-blt-4tiled:
    - shard-iclb:         NOTRUN -> [SKIP][111] ([i915#5287])
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_draw_crc@draw-method-xrgb2101010-blt-4tiled.html

  * igt@kms_flip@2x-absolute-wf_vblank:
    - shard-iclb:         NOTRUN -> [SKIP][112] ([fdo#109274])
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_flip@2x-absolute-wf_vblank.html

  * igt@kms_flip@2x-flip-vs-fences:
    - shard-tglb:         NOTRUN -> [SKIP][113] ([fdo#109274] / [fdo#111825]) +2 similar issues
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_flip@2x-flip-vs-fences.html

  * igt@kms_flip@flip-vs-expired-vblank@b-edp1:
    - shard-skl:          [PASS][114] -> [FAIL][115] ([i915#79])
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl1/igt@kms_flip@flip-vs-expired-vblank@b-edp1.html
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@kms_flip@flip-vs-expired-vblank@b-edp1.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-kbl:          [PASS][116] -> [INCOMPLETE][117] ([i915#3614])
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl7/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl4/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling:
    - shard-tglb:         NOTRUN -> [SKIP][118] ([i915#2587])
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html
    - shard-iclb:         [PASS][119] -> [SKIP][120] ([i915#3701]) +1 similar issue
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb4/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-downscaling.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-render:
    - shard-iclb:         NOTRUN -> [SKIP][121] ([fdo#109280]) +13 similar issues
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-render:
    - shard-tglb:         NOTRUN -> [SKIP][122] ([fdo#109280] / [fdo#111825]) +4 similar issues
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-render.html

  * igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes:
    - shard-apl:          [PASS][123] -> [DMESG-WARN][124] ([i915#180]) +2 similar issues
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl8/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes.html

  * igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min:
    - shard-skl:          NOTRUN -> [FAIL][125] ([fdo#108145] / [i915#265]) +1 similar issue
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min.html

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          [PASS][126] -> [FAIL][127] ([fdo#108145] / [i915#265]) +2 similar issues
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl6/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl7/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html

  * igt@kms_plane_lowres@pipe-c-tiling-none:
    - shard-tglb:         NOTRUN -> [SKIP][128] ([i915#3536])
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_plane_lowres@pipe-c-tiling-none.html

  * igt@kms_plane_scaling@downscale-with-rotation-factor-0-25@pipe-c-edp-1-downscale-with-rotation:
    - shard-tglb:         NOTRUN -> [SKIP][129] ([i915#5176]) +3 similar issues
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_plane_scaling@downscale-with-rotation-factor-0-25@pipe-c-edp-1-downscale-with-rotation.html

  * igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1-planes-downscale:
    - shard-iclb:         [PASS][130] -> [SKIP][131] ([i915#5235]) +2 similar issues
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb7/igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1-planes-downscale.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1-planes-downscale.html

  * igt@kms_psr2_su@page_flip-nv12:
    - shard-iclb:         NOTRUN -> [SKIP][132] ([fdo#109642] / [fdo#111068] / [i915#658])
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_psr2_su@page_flip-nv12.html

  * igt@kms_psr2_su@page_flip-p010:
    - shard-skl:          NOTRUN -> [SKIP][133] ([fdo#109271] / [i915#658])
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_psr2_su@page_flip-p010.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [PASS][134] -> [SKIP][135] ([fdo#109441])
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@kms_psr@psr2_cursor_render.html
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@kms_psr@psr2_cursor_render.html

  * igt@kms_psr@psr2_sprite_mmap_gtt:
    - shard-tglb:         NOTRUN -> [FAIL][136] ([i915#132] / [i915#3467])
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@kms_psr@psr2_sprite_mmap_gtt.html

  * igt@kms_writeback@writeback-check-output:
    - shard-skl:          NOTRUN -> [SKIP][137] ([fdo#109271] / [i915#2437])
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@kms_writeback@writeback-check-output.html

  * igt@kms_writeback@writeback-pixel-formats:
    - shard-iclb:         NOTRUN -> [SKIP][138] ([i915#2437]) +1 similar issue
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@kms_writeback@writeback-pixel-formats.html

  * igt@nouveau_crc@pipe-b-source-outp-inactive:
    - shard-iclb:         NOTRUN -> [SKIP][139] ([i915#2530])
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@nouveau_crc@pipe-b-source-outp-inactive.html

  * igt@perf@unprivileged-single-ctx-counters:
    - shard-tglb:         NOTRUN -> [SKIP][140] ([fdo#109289])
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@perf@unprivileged-single-ctx-counters.html

  * igt@prime_nv_api@i915_self_import_to_different_fd:
    - shard-tglb:         NOTRUN -> [SKIP][141] ([fdo#109291])
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@prime_nv_api@i915_self_import_to_different_fd.html

  * igt@prime_nv_api@nv_i915_reimport_twice_check_flink_name:
    - shard-iclb:         NOTRUN -> [SKIP][142] ([fdo#109291])
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@prime_nv_api@nv_i915_reimport_twice_check_flink_name.html

  * igt@prime_vgem@basic-userptr:
    - shard-iclb:         NOTRUN -> [SKIP][143] ([i915#3301])
   [143]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@prime_vgem@basic-userptr.html

  * igt@syncobj_timeline@invalid-transfer-non-existent-point:
    - shard-apl:          NOTRUN -> [DMESG-WARN][144] ([i915#5098])
   [144]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@syncobj_timeline@invalid-transfer-non-existent-point.html
    - shard-skl:          NOTRUN -> [DMESG-WARN][145] ([i915#5098])
   [145]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@syncobj_timeline@invalid-transfer-non-existent-point.html

  * igt@sysfs_clients@fair-1:
    - shard-iclb:         NOTRUN -> [SKIP][146] ([i915#2994])
   [146]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@sysfs_clients@fair-1.html

  * igt@sysfs_clients@sema-25:
    - shard-skl:          NOTRUN -> [SKIP][147] ([fdo#109271] / [i915#2994]) +1 similar issue
   [147]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@sysfs_clients@sema-25.html
    - shard-apl:          NOTRUN -> [SKIP][148] ([fdo#109271] / [i915#2994])
   [148]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@sysfs_clients@sema-25.html

  * igt@sysfs_clients@split-25:
    - shard-kbl:          NOTRUN -> [SKIP][149] ([fdo#109271] / [i915#2994])
   [149]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@sysfs_clients@split-25.html

  * igt@sysfs_heartbeat_interval@mixed@bcs0:
    - shard-skl:          [PASS][150] -> [WARN][151] ([i915#4055])
   [150]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@sysfs_heartbeat_interval@mixed@bcs0.html
   [151]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@sysfs_heartbeat_interval@mixed@bcs0.html

  * igt@sysfs_heartbeat_interval@mixed@vcs0:
    - shard-skl:          [PASS][152] -> [FAIL][153] ([i915#1731])
   [152]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@sysfs_heartbeat_interval@mixed@vcs0.html
   [153]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@sysfs_heartbeat_interval@mixed@vcs0.html

  
#### Possible fixes ####

  * igt@gem_ctx_isolation@preservation-s3@vecs0:
    - shard-skl:          [INCOMPLETE][154] ([i915#4939]) -> [PASS][155]
   [154]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl4/igt@gem_ctx_isolation@preservation-s3@vecs0.html
   [155]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl1/igt@gem_ctx_isolation@preservation-s3@vecs0.html

  * igt@gem_eio@in-flight-contexts-1us:
    - {shard-tglu}:       [TIMEOUT][156] ([i915#3063]) -> [PASS][157]
   [156]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglu-3/igt@gem_eio@in-flight-contexts-1us.html
   [157]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglu-4/igt@gem_eio@in-flight-contexts-1us.html

  * igt@gem_eio@unwedge-stress:
    - shard-iclb:         [TIMEOUT][158] ([i915#3070]) -> [PASS][159]
   [158]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@gem_eio@unwedge-stress.html
   [159]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-pace-solo@rcs0:
    - shard-tglb:         [FAIL][160] ([i915#2842]) -> [PASS][161]
   [160]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb5/igt@gem_exec_fair@basic-pace-solo@rcs0.html
   [161]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb6/igt@gem_exec_fair@basic-pace-solo@rcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [FAIL][162] ([i915#2842]) -> [PASS][163] +1 similar issue
   [162]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl6/igt@gem_exec_fair@basic-pace@vecs0.html
   [163]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl3/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_exec_flush@basic-batch-kernel-default-uc:
    - shard-snb:          [SKIP][164] ([fdo#109271]) -> [PASS][165]
   [164]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb6/igt@gem_exec_flush@basic-batch-kernel-default-uc.html
   [165]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb4/igt@gem_exec_flush@basic-batch-kernel-default-uc.html

  * igt@gen9_exec_parse@allowed-single:
    - shard-kbl:          [DMESG-WARN][166] ([i915#5566] / [i915#716]) -> [PASS][167]
   [166]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@gen9_exec_parse@allowed-single.html
   [167]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@gen9_exec_parse@allowed-single.html

  * igt@i915_pm_rpm@cursor:
    - shard-iclb:         [SKIP][168] -> [PASS][169]
   [168]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@i915_pm_rpm@cursor.html
   [169]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@i915_pm_rpm@cursor.html

  * igt@i915_selftest@live@gt_lrc:
    - shard-iclb:         [DMESG-WARN][170] ([i915#2867]) -> [PASS][171] +7 similar issues
   [170]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@i915_selftest@live@gt_lrc.html
   [171]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@i915_selftest@live@gt_lrc.html

  * igt@i915_selftest@live@hangcheck:
    - shard-snb:          [INCOMPLETE][172] ([i915#3921]) -> [PASS][173]
   [172]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-snb6/igt@i915_selftest@live@hangcheck.html
   [173]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-snb7/igt@i915_selftest@live@hangcheck.html
    - shard-tglb:         [DMESG-WARN][174] ([i915#5591]) -> [PASS][175]
   [174]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb8/igt@i915_selftest@live@hangcheck.html
   [175]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb6/igt@i915_selftest@live@hangcheck.html

  * igt@i915_selftest@perf@engine_cs:
    - shard-tglb:         [DMESG-WARN][176] ([i915#2867]) -> [PASS][177] +2 similar issues
   [176]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb2/igt@i915_selftest@perf@engine_cs.html
   [177]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb8/igt@i915_selftest@perf@engine_cs.html

  * igt@kms_cursor_legacy@cursor-vs-flip-toggle:
    - shard-iclb:         [FAIL][178] ([i915#5072]) -> [PASS][179]
   [178]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb7/igt@kms_cursor_legacy@cursor-vs-flip-toggle.html
   [179]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@kms_cursor_legacy@cursor-vs-flip-toggle.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
    - shard-glk:          [FAIL][180] ([i915#2346]) -> [PASS][181]
   [180]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-glk2/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
   [181]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-glk2/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-apl:          [FAIL][182] ([i915#4767]) -> [PASS][183]
   [182]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@kms_fbcon_fbt@fbc-suspend.html
   [183]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl6/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1:
    - shard-skl:          [FAIL][184] ([i915#79]) -> [PASS][185]
   [184]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html
   [185]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl6/igt@kms_flip@flip-vs-expired-vblank-interruptible@b-edp1.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-apl:          [DMESG-WARN][186] ([i915#180]) -> [PASS][187] +2 similar issues
   [186]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [187]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip@flip-vs-suspend@a-edp1:
    - shard-skl:          [INCOMPLETE][188] ([i915#4839]) -> [PASS][189]
   [188]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl9/igt@kms_flip@flip-vs-suspend@a-edp1.html
   [189]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@kms_flip@flip-vs-suspend@a-edp1.html

  * igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling:
    - shard-iclb:         [SKIP][190] ([i915#3701]) -> [PASS][191]
   [190]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html
   [191]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_flip_scaled_crc@flip-64bpp-ytile-to-16bpp-ytile-downscaling.html

  * igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min:
    - shard-skl:          [FAIL][192] ([fdo#108145] / [i915#265]) -> [PASS][193]
   [192]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min.html
   [193]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl8/igt@kms_plane_alpha_blend@pipe-b-constant-alpha-min.html

  * igt@kms_psr2_su@page_flip-xrgb8888:
    - shard-iclb:         [SKIP][194] ([fdo#109642] / [fdo#111068] / [i915#658]) -> [PASS][195]
   [194]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb4/igt@kms_psr2_su@page_flip-xrgb8888.html
   [195]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_psr2_su@page_flip-xrgb8888.html

  * igt@kms_psr@psr2_no_drrs:
    - shard-iclb:         [SKIP][196] ([fdo#109441]) -> [PASS][197] +2 similar issues
   [196]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@kms_psr@psr2_no_drrs.html
   [197]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_psr@psr2_no_drrs.html

  * igt@kms_vblank@pipe-a-ts-continuation-modeset-rpm:
    - shard-iclb:         [SKIP][198] ([fdo#109278]) -> [PASS][199]
   [198]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@kms_vblank@pipe-a-ts-continuation-modeset-rpm.html
   [199]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_vblank@pipe-a-ts-continuation-modeset-rpm.html

  * igt@perf@stress-open-close:
    - shard-apl:          [DMESG-FAIL][200] -> [PASS][201]
   [200]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@perf@stress-open-close.html
   [201]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@perf@stress-open-close.html

  * igt@prime_self_import@reimport-vs-gem_close-race:
    - shard-tglb:         [FAIL][202] -> [PASS][203]
   [202]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb3/igt@prime_self_import@reimport-vs-gem_close-race.html
   [203]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb2/igt@prime_self_import@reimport-vs-gem_close-race.html

  
#### Warnings ####

  * igt@gem_eio@unwedge-stress:
    - shard-tglb:         [FAIL][204] ([i915#5784]) -> [TIMEOUT][205] ([i915#3063])
   [204]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-tglb8/igt@gem_eio@unwedge-stress.html
   [205]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-tglb6/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_balancer@parallel:
    - shard-iclb:         [SKIP][206] ([i915#4525]) -> [DMESG-WARN][207] ([i915#5614])
   [206]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb5/igt@gem_exec_balancer@parallel.html
   [207]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb1/igt@gem_exec_balancer@parallel.html

  * igt@gem_exec_balancer@parallel-out-fence:
    - shard-iclb:         [DMESG-WARN][208] ([i915#5614]) -> [SKIP][209] ([i915#4525]) +1 similar issue
   [208]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb4/igt@gem_exec_balancer@parallel-out-fence.html
   [209]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb7/igt@gem_exec_balancer@parallel-out-fence.html

  * igt@gem_exec_fair@basic-none-rrul@rcs0:
    - shard-iclb:         [FAIL][210] ([i915#2852]) -> [FAIL][211] ([i915#2842])
   [210]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb7/igt@gem_exec_fair@basic-none-rrul@rcs0.html
   [211]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@gem_exec_fair@basic-none-rrul@rcs0.html

  * igt@i915_pm_dc@dc3co-vpb-simulation:
    - shard-iclb:         [SKIP][212] ([i915#588]) -> [SKIP][213] ([i915#658])
   [212]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@i915_pm_dc@dc3co-vpb-simulation.html
   [213]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb4/igt@i915_pm_dc@dc3co-vpb-simulation.html

  * igt@i915_pm_rpm@modeset-non-lpsp:
    - shard-iclb:         [SKIP][214] -> [SKIP][215] ([fdo#110892])
   [214]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@i915_pm_rpm@modeset-non-lpsp.html
   [215]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@i915_pm_rpm@modeset-non-lpsp.html

  * igt@kms_ccs@pipe-d-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          [SKIP][216] ([fdo#109271] / [i915#1888]) -> [SKIP][217] ([fdo#109271])
   [216]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl8/igt@kms_ccs@pipe-d-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc.html
   [217]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl6/igt@kms_ccs@pipe-d-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_chamelium@dp-audio:
    - shard-skl:          [SKIP][218] ([fdo#109271] / [fdo#111827] / [i915#1888]) -> [SKIP][219] ([fdo#109271] / [fdo#111827])
   [218]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl1/igt@kms_chamelium@dp-audio.html
   [219]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl4/igt@kms_chamelium@dp-audio.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-varying-size:
    - shard-skl:          [SKIP][220] ([fdo#109271]) -> [SKIP][221] ([fdo#109271] / [i915#1888])
   [220]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl4/igt@kms_cursor_legacy@cursora-vs-flipb-varying-size.html
   [221]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@kms_cursor_legacy@cursora-vs-flipb-varying-size.html

  * igt@kms_psr2_sf@overlay-plane-update-continuous-sf:
    - shard-iclb:         [SKIP][222] ([i915#2920]) -> [SKIP][223] ([fdo#111068] / [i915#658])
   [222]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb2/igt@kms_psr2_sf@overlay-plane-update-continuous-sf.html
   [223]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb8/igt@kms_psr2_sf@overlay-plane-update-continuous-sf.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area:
    - shard-iclb:         [SKIP][224] ([fdo#111068] / [i915#658]) -> [SKIP][225] ([i915#2920])
   [224]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-iclb6/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area.html
   [225]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][226], [FAIL][227], [FAIL][228], [FAIL][229], [FAIL][230], [FAIL][231], [FAIL][232], [FAIL][233], [FAIL][234], [FAIL][235]) ([fdo#109271] / [i915#3002] / [i915#4312] / [i915#5257] / [i915#716]) -> ([FAIL][236], [FAIL][237], [FAIL][238], [FAIL][239], [FAIL][240], [FAIL][241], [FAIL][242], [FAIL][243], [FAIL][244]) ([i915#3002] / [i915#4312] / [i915#5257])
   [226]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl6/igt@runner@aborted.html
   [227]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl6/igt@runner@aborted.html
   [228]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@runner@aborted.html
   [229]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl4/igt@runner@aborted.html
   [230]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [231]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [232]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [233]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl1/igt@runner@aborted.html
   [234]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl1/igt@runner@aborted.html
   [235]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-kbl3/igt@runner@aborted.html
   [236]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@runner@aborted.html
   [237]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl3/igt@runner@aborted.html
   [238]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl3/igt@runner@aborted.html
   [239]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@runner@aborted.html
   [240]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl1/igt@runner@aborted.html
   [241]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@runner@aborted.html
   [242]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@runner@aborted.html
   [243]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl4/igt@runner@aborted.html
   [244]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-kbl6/igt@runner@aborted.html
    - shard-apl:          ([FAIL][245], [FAIL][246], [FAIL][247], [FAIL][248], [FAIL][249], [FAIL][250]) ([i915#180] / [i915#3002] / [i915#4312] / [i915#5257]) -> ([FAIL][251], [FAIL][252], [FAIL][253], [FAIL][254], [FAIL][255], [FAIL][256], [FAIL][257]) ([fdo#109271] / [i915#180] / [i915#3002] / [i915#4312] / [i915#5257])
   [245]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@runner@aborted.html
   [246]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/igt@runner@aborted.html
   [247]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl4/igt@runner@aborted.html
   [248]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl7/igt@runner@aborted.html
   [249]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl3/igt@runner@aborted.html
   [250]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-apl1/igt@runner@aborted.html
   [251]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@runner@aborted.html
   [252]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl1/igt@runner@aborted.html
   [253]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl3/igt@runner@aborted.html
   [254]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/igt@runner@aborted.html
   [255]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl8/igt@runner@aborted.html
   [256]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl4/igt@runner@aborted.html
   [257]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-apl7/igt@runner@aborted.html
    - shard-skl:          ([FAIL][258], [FAIL][259], [FAIL][260]) ([i915#3002] / [i915#4312] / [i915#5257]) -> ([FAIL][261], [FAIL][262], [FAIL][263], [FAIL][264], [FAIL][265]) ([i915#2029] / [i915#3002] / [i915#4312] / [i915#5257])
   [258]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl1/igt@runner@aborted.html
   [259]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl10/igt@runner@aborted.html
   [260]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11668/shard-skl7/igt@runner@aborted.html
   [261]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@runner@aborted.html
   [262]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl3/igt@runner@aborted.html
   [263]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl4/igt@runner@aborted.html
   [264]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl10/igt@runner@aborted.html
   [265]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/shard-skl6/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#110725]: https://bugs.freedesktop.org/show_bug.cgi?id=110725
  [fdo#110892]: https://bugs.freedesktop.org/show_bug.cgi?id=110892
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111614]: https://bugs.freedesktop.org/show_bug.cgi?id=111614
  [fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
  [fdo#111644]: https://bugs.freedesktop.org/show_bug.cgi?id=111644
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1319]: https://gitlab.freedesktop.org/drm/intel/issues/1319
  [i915#132]: https://gitlab.freedesktop.org/drm/intel/issues/132
  [i915#1397]: https://gitlab.freedesktop.org/drm/intel/issues/1397
  [i915#1731]: https://gitlab.freedesktop.org/drm/intel/issues/1731
  [i915#1769]: https://gitlab.freedesktop.org/drm/intel/issues/1769
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#2029]: https://gitlab.freedesktop.org/drm/intel/issues/2029
  [i915#2105]: https://gitlab.freedesktop.org/drm/intel/issues/2105
  [i915#2346]: https://gitlab.freedesktop.org/drm/intel/issues/2346
  [i915#2411]: https://gitlab.freedesktop.org/drm/intel/issues/2411
  [i915#2437]: https://gitlab.freedesktop.org/drm/intel/issues/2437
  [i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
  [i915#2530]: https://gitlab.freedesktop.org/drm/intel/issues/2530
  [i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
  [i915#265]: https://gitlab.freedesktop.org/drm/intel/issues/265
  [i915#2681]: https://gitlab.freedesktop.org/drm/intel/issues/2681
  [i915#2684]: https://gitlab.freedesktop.org/drm/intel/issues/2684
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2849]: https://gitlab.freedesktop.org/drm/intel/issues/2849
  [i915#2852]: https://gitlab.freedesktop.org/drm/intel/issues/2852
  [i915#2856]: https://gitlab.freedesktop.org/drm/intel/issues/2856
  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
  [i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3002]: https://gitlab.freedesktop.org/drm/intel/issues/3002
  [i915#3063]: https://gitlab.freedesktop.org/drm/intel/issues/3063
  [i915#3070]: https://gitlab.freedesktop.org/drm/intel/issues/3070
  [i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3318]: https://gitlab.freedesktop.org/drm/intel/issues/3318
  [i915#3323]: https://gitlab.freedesktop.org/drm/intel/issues/3323
  [i915#3354]: https://gitlab.freedesktop.org/drm/intel/issues/3354
  [i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
  [i915#3467]: https://gitlab.freedesktop.org/drm/intel/issues/3467
  [i915#3536]: https://gitlab.freedesktop.org/drm/intel/issues/3536
  [i915#3591]: https://gitlab.freedesktop.org/drm/intel/issues/3591
  [i915#3614]: https://gitlab.freedesktop.org/drm/intel/issues/3614
  [i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
  [i915#3701]: https://gitlab.freedesktop.org/drm/intel/issues/3701
  [i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#4055]: https://gitlab.freedesktop.org/drm/intel/issues/4055
  [i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
  [i915#4281]: https://gitlab.freedesktop.org/drm/intel/issues/4281
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4386]: https://gitlab.freedesktop.org/drm/intel/issues/4386
  [i915#4525]: https://gitlab.freedesktop.org/drm/intel/issues/4525
  [i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4767]: https://gitlab.freedesktop.org/drm/intel/issues/4767
  [i915#4839]: https://gitlab.freedesktop.org/drm/intel/issues/4839
  [i915#4939]: https://gitlab.freedesktop.org/drm/intel/issues/4939
  [i915#5072]: https://gitlab.freedesktop.org/drm/intel/issues/5072
  [i915#5098]: https://gitlab.freedesktop.org/drm/intel/issues/5098
  [i915#5161]: https://gitlab.freedesktop.org/drm/intel/issues/5161
  [i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
  [i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
  [i915#5257]: https://gitlab.freedesktop.org/drm/intel/issues/5257
  [i915#5286]: https://gitlab.freedesktop.org/drm/intel/issues/5286
  [i915#5287]: https://gitlab.freedesktop.org/drm/intel/issues/5287
  [i915#5327]: https://gitlab.freedesktop.org/drm/intel/issues/5327
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#5566]: https://gitlab.freedesktop.org/drm/intel/issues/5566
  [i915#5591]: https://gitlab.freedesktop.org/drm/intel/issues/5591
  [i915#5614]: https://gitlab.freedesktop.org/drm/intel/issues/5614
  [i915#5691]: https://gitlab.freedesktop.org/drm/intel/issues/5691
  [i915#5784]: https://gitlab.freedesktop.org/drm/intel/issues/5784
  [i915#588]: https://gitlab.freedesktop.org/drm/intel/issues/588
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
  [i915#716]: https://gitlab.freedesktop.org/drm/intel/issues/716
  [i915#768]: https://gitlab.freedesktop.org/drm/intel/issues/768
  [i915#79]: https://gitlab.freedesktop.org/drm/intel/issues/79


Build changes
-------------

  * Linux: CI_DRM_11668 -> Patchwork_93447v3

  CI-20190529: 20190529
  CI_DRM_11668: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6477: 70cfef35851891aeaa829f5e8dcb7fd43b454bde @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_93447v3: 0aeb4ff42e2e9fd1dee49e6bb79cc81c8eafd3fc @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_93447v3/index.html

[-- Attachment #2: Type: text/html, Size: 66688 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32   ` Niranjana Vishwanathapura
@ 2022-05-19 22:52     ` Zanoni, Paulo R
  -1 siblings, 0 replies; 121+ messages in thread
From: Zanoni, Paulo R @ 2022-05-19 22:52 UTC (permalink / raw)
  To: dri-devel, Vetter, Daniel, Vishwanathapura, Niranjana, intel-gfx
  Cc: Brost, Matthew, Hellstrom, Thomas, Wilson, Chris P, jason,
	christian.koenig

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.
> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points
to new ioctls and we're not very familiar with those yet, so I think
you should really clarify the interaction between the new additions
here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always
have a few exported buffers in every execbuf call, and we rely on the
implicit synchronization provided by execbuf to make sure everything
works. The execbuf ioctl also has some code to flush caches during
implicit synchronization AFAIR, so I would guess we rely on it too and
whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of
vm_bind was that it would help reduce ioctl latency and cpu overhead.
But if making execbuf faster comes at the cost of requiring additional
ioctls calls for implicit synchronization, which is required on ever
execbuf call, then I wonder if we'll even get any faster at all.
Comparing old execbuf vs plain new execbuf without the new required
ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around
every single execbuf ioctl we submit? Again, more clarification and
some code examples here would be really nice. This is a big change on
an important part of the API, we should clarify the new expected usage.

> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before
we commit to any interface. Again, implicit synchronization is
something we rely on during *every* execbuf ioctl for most workloads.


> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of
ioctls would stop working or make no sense to call when using vm_bind.
Can we please get a complete list of those? Bonus points if the Kernel
starts telling us we just called something that makes no sense.

> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the
ideal case for user space would be that every BO is created as private
but then we'd have an ioctl to convert it to non-private (without the
need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a
buffer that was previously vm_private would be really appreciated.

Thanks,
Paulo


> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of these
> +locks. There we will simply smash the new batch buffer address into the ring and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence) proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-19 22:52     ` Zanoni, Paulo R
  0 siblings, 0 replies; 121+ messages in thread
From: Zanoni, Paulo R @ 2022-05-19 22:52 UTC (permalink / raw)
  To: dri-devel, Vetter, Daniel, Vishwanathapura, Niranjana, intel-gfx
  Cc: Hellstrom, Thomas, Wilson, Chris P, christian.koenig

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.
> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points
to new ioctls and we're not very familiar with those yet, so I think
you should really clarify the interaction between the new additions
here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always
have a few exported buffers in every execbuf call, and we rely on the
implicit synchronization provided by execbuf to make sure everything
works. The execbuf ioctl also has some code to flush caches during
implicit synchronization AFAIR, so I would guess we rely on it too and
whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of
vm_bind was that it would help reduce ioctl latency and cpu overhead.
But if making execbuf faster comes at the cost of requiring additional
ioctls calls for implicit synchronization, which is required on ever
execbuf call, then I wonder if we'll even get any faster at all.
Comparing old execbuf vs plain new execbuf without the new required
ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around
every single execbuf ioctl we submit? Again, more clarification and
some code examples here would be really nice. This is a big change on
an important part of the API, we should clarify the new expected usage.

> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before
we commit to any interface. Again, implicit synchronization is
something we rely on during *every* execbuf ioctl for most workloads.


> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of
ioctls would stop working or make no sense to call when using vm_bind.
Can we please get a complete list of those? Bonus points if the Kernel
starts telling us we just called something that makes no sense.

> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the
ideal case for user space would be that every BO is created as private
but then we'd have an ioctl to convert it to non-private (without the
need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a
buffer that was previously vm_private would be really appreciated.

Thanks,
Paulo


> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of these
> +locks. There we will simply smash the new batch buffer address into the ring and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence) proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32   ` Niranjana Vishwanathapura
  (?)
@ 2022-05-19 23:07   ` Zanoni, Paulo R
  2022-05-23 19:19     ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Zanoni, Paulo R @ 2022-05-19 23:07 UTC (permalink / raw)
  To: dri-devel, Vetter, Daniel, Vishwanathapura, Niranjana, intel-gfx
  Cc: Hellstrom, Thomas, christian.koenig, Wilson, Chris P

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND and related uapi definitions
> 
> v2: Ensure proper kernel-doc formatting with cross references.
>     Also add new uapi and documentation as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>  1 file changed, 399 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..589c0a009107
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,399 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND		57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> + * to pass in the batch buffer addresses.
> + *
> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> + */

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 {
	__u64 buffers_ptr;		-> must be 0 (new)
	__u32 buffer_count;		-> must be 0 (new)
	__u32 batch_start_offset;	-> must be 0 (new)
	__u32 batch_len;		-> must be 0 (new)
	__u32 DR1;			-> must be 0 (old)
	__u32 DR4;			-> must be 0 (old)
	__u32 num_cliprects; (fences)	-> must be 0 since using extensions
	__u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
	__u64 flags;			-> some flags must be 0 (new)
	__u64 rsvd1; (context info)	-> repurposed field (old)
	__u64 rsvd2;			-> unused
};

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
of adding even more complexity to an already abused interface? While
the Vulkan-like extension thing is really nice, I don't think what
we're doing here is extending the ioctl usage, we're completely
changing how the base struct should be interpreted based on how the VM
was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
already at -6 without these changes. I think after vm_bind we'll need
to create a -11 entry just to deal with this ioctl.


+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
+ * to be not used.
+ *
+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
+ * to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ */
+struct drm_i915_gem_vm_bind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @handle: Object handle */
+	__u32 handle;
+
+	/** @start: Virtual Address start to bind */
+	__u64 start;
+
+	/** @offset: Offset in object to bind */
+	__u64 offset;
+
+	/** @length: Length of mapping to bind */
+	__u64 length;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_GEM_VM_BIND_READONLY:
+	 * Mapping is read-only.
+	 *
+	 * I915_GEM_VM_BIND_CAPTURE:
+	 * Capture this mapping in the dump upon GPU error.
+	 */
+	__u64 flags;
+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
+ *
+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
+ * address (VA) range that should be unbound from the device page table of the
+ * specified address space (VM). The specified VA range must match one of the
+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
+ * completion.
+ */
+struct drm_i915_gem_vm_unbind {
+	/** @vm_id: VM (address space) id to bind */
+	__u32 vm_id;
+
+	/** @rsvd: Reserved for future use; must be zero. */
+	__u32 rsvd;
+
+	/** @start: Virtual Address start to unbind */
+	__u64 start;
+
+	/** @length: Length of mapping to unbind */
+	__u64 length;
+
+	/** @flags: reserved for future usage, currently MBZ */
+	__u64 flags;
+
+	/** @extensions: 0-terminated chain of extensions for this mapping. */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
+ * or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
+ * before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the returned output fence
+ * after the completion of binding or unbinding.
+ */
+struct drm_i915_vm_bind_fence {
+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
+	__u32 handle;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
+ * and vm_unbind.
+ *
+ * This structure describes an array of timeline drm_syncobj and associated
+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
+ */
+struct drm_i915_vm_bind_ext_timeline_fences {
+#define I915_VM_BIND_EXT_timeline_FENCES	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
+	 * arrays.
+	 */
+	__u64 fence_count;
+
+	/**
+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
+	 * of length @fence_count.
+	 */
+	__u64 handles_ptr;
+
+	/**
+	 * @values_ptr: Pointer to an array of u64 values of length
+	 * @fence_count.
+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
+	 * binary one.
+	 */
+	__u64 values_ptr;
+};
+
+/**
+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
+ * vm_bind or the vm_unbind work.
+ *
+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
+ * @addr to become equal to @val) before starting the binding or unbinding.
+ *
+ * The vm_bind or vm_unbind async worker will signal the output fence after
+ * the completion of binding or unbinding by writing @val to memory location at
+ * @addr
+ */
+struct drm_i915_vm_bind_user_fence {
+	/** @addr: User/Memory fence qword aligned process virtual address */
+	__u64 addr;
+
+	/** @val: User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_VM_BIND_USER_FENCE_WAIT:
+	 * Wait for the input fence before binding/unbinding
+	 *
+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
+	 * Return bind/unbind completion fence as output
+	 */
+	__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
+ * and vm_unbind.
+ *
+ * These user fences can be input or output fences
+ * (See struct drm_i915_vm_bind_user_fence).
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
+	__u64 fence_count;
+
+	/**
+	 * @user_fence_ptr: Pointer to an array of
+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
+	 */
+	__u64 user_fence_ptr;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
+ * gpu virtual addresses.
+ *
+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
+ * must always be appended in the VM_BIND mode and it will be an error to
+ * append this extension in older non-VM_BIND mode.
+ */
+struct drm_i915_gem_execbuffer_ext_batch_addresses {
+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @count: Number of addresses in the addr array. */
+	__u32 count;
+
+	/** @addr: An array of batch gpu virtual addresses. */
+	__u64 addr[0];
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (@addr, @value pair) to an
+ * execbuf to be signaled by the command streamer after the completion of first
+ * level batch, by writing the @value at specified @addr and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * @addr: User/Memory fence qword aligned GPU virtual address.
+	 *
+	 * Address has to be a valid GPU virtual address at the time of
+	 * first level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * @value: User/Memory fence Value to be written to above address
+	 * after first level batch completes.
+	 */
+	__u64 value;
+
+	/** @rsvd: Reserved for future extensions, MBZ */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
+ * private to the specified VM.
+ *
+ * See struct drm_i915_gem_create_ext.
+ */
+struct drm_i915_gem_create_ext_vm_private {
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** @vm_id: Id of the VM to which the object is private */
+	__u32 vm_id;
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
+ *
+ * User/Memory fence can be woken up either by:
+ *
+ * 1. GPU context indicated by @ctx_id, or,
+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *    @ctx_id is ignored when this flag is set.
+ *
+ * Wakeup condition is,
+ * ``((*addr & mask) op (value & mask))``
+ *
+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
+ */
+struct drm_i915_gem_wait_user_fence {
+	/** @extensions: Zero-terminated chain of extensions. */
+	__u64 extensions;
+
+	/** @addr: User/Memory fence address */
+	__u64 addr;
+
+	/** @ctx_id: Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** @op: Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/**
+	 * @flags: Supported flags are,
+	 *
+	 * I915_UFENCE_WAIT_SOFT:
+	 *
+	 * To be woken up by i915 driver async worker (not by GPU).
+	 *
+	 * I915_UFENCE_WAIT_ABSTIME:
+	 *
+	 * Wait timeout specified as absolute time.
+	 */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** @value: Wakeup value */
+	__u64 value;
+
+	/** @mask: Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/**
+	 * @timeout: Wait timeout in nanoseconds.
+	 *
+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
+	 * absolute time in nsec.
+	 */
+	__s64 timeout;
+};


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-19 22:52     ` [Intel-gfx] " Zanoni, Paulo R
@ 2022-05-23 19:05       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:05 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: Brost, Matthew, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, jason, Vetter, Daniel, christian.koenig

On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>
>I would really like to have more details here. The link provided points
>to new ioctls and we're not very familiar with those yet, so I think
>you should really clarify the interaction between the new additions
>here. Having some sample code would be really nice too.
>
>For Mesa at least (and I believe for the other drivers too) we always
>have a few exported buffers in every execbuf call, and we rely on the
>implicit synchronization provided by execbuf to make sure everything
>works. The execbuf ioctl also has some code to flush caches during
>implicit synchronization AFAIR, so I would guess we rely on it too and
>whatever else the Kernel does. Is that covered by the new ioctls?
>
>In addition, as far as I remember, one of the big improvements of
>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>But if making execbuf faster comes at the cost of requiring additional
>ioctls calls for implicit synchronization, which is required on ever
>execbuf call, then I wonder if we'll even get any faster at all.
>Comparing old execbuf vs plain new execbuf without the new required
>ioctls won't make sense.
>
>But maybe I'm wrong and we won't need to call these new ioctls around
>every single execbuf ioctl we submit? Again, more clarification and
>some code examples here would be really nice. This is a big change on
>an important part of the API, we should clarify the new expected usage.
>

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in
execbuff path is for implicit synchronization. And AFAIK, this work
from Jason is expected replace implict synchronization with new ioctls.
Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this
new mechanism from Jason is expected work for vl. For gl, there is a
question of whether it will be performant or not. But it is worth trying
that first. If it is not performant for gl, then only we can consider
adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>
>IMHO we really need to sort this and check all the assumptions before
>we commit to any interface. Again, implicit synchronization is
>something we rely on during *every* execbuf ioctl for most workloads.
>

Daniel's earlier feedback was that it is worth Mesa trying this new
mechanism for gl and see it that works. We want to avoid supporting
execlist support for implicit sync in vm_bind mode from the beginning
if it is going to be deemed not necessary.

>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>
>I seem to recall some conversations where we were told a bunch of
>ioctls would stop working or make no sense to call when using vm_bind.
>Can we please get a complete list of those? Bonus points if the Kernel
>starts telling us we just called something that makes no sense.
>

Which ioctls you are talking about here?
We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
already documented in this patch).

>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>I know we already discussed this, but just to document it publicly: the
>ideal case for user space would be that every BO is created as private
>but then we'd have an ioctl to convert it to non-private (without the
>need to have a non-private->private interface).
>
>An explanation on why we can't have an ioctl to mark as exported a
>buffer that was previously vm_private would be really appreciated.
>

Ok, I can some notes on that.
The reason being the fact that this require changing the dma-resv object
for gem object, hence the object locking also. This will add complications
as we have to sync with any pending operations. It might be easier for
UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

>Thanks,
>Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-23 19:05       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:05 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	Daniel, christian.koenig

On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>
>I would really like to have more details here. The link provided points
>to new ioctls and we're not very familiar with those yet, so I think
>you should really clarify the interaction between the new additions
>here. Having some sample code would be really nice too.
>
>For Mesa at least (and I believe for the other drivers too) we always
>have a few exported buffers in every execbuf call, and we rely on the
>implicit synchronization provided by execbuf to make sure everything
>works. The execbuf ioctl also has some code to flush caches during
>implicit synchronization AFAIR, so I would guess we rely on it too and
>whatever else the Kernel does. Is that covered by the new ioctls?
>
>In addition, as far as I remember, one of the big improvements of
>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>But if making execbuf faster comes at the cost of requiring additional
>ioctls calls for implicit synchronization, which is required on ever
>execbuf call, then I wonder if we'll even get any faster at all.
>Comparing old execbuf vs plain new execbuf without the new required
>ioctls won't make sense.
>
>But maybe I'm wrong and we won't need to call these new ioctls around
>every single execbuf ioctl we submit? Again, more clarification and
>some code examples here would be really nice. This is a big change on
>an important part of the API, we should clarify the new expected usage.
>

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in
execbuff path is for implicit synchronization. And AFAIK, this work
from Jason is expected replace implict synchronization with new ioctls.
Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this
new mechanism from Jason is expected work for vl. For gl, there is a
question of whether it will be performant or not. But it is worth trying
that first. If it is not performant for gl, then only we can consider
adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>
>IMHO we really need to sort this and check all the assumptions before
>we commit to any interface. Again, implicit synchronization is
>something we rely on during *every* execbuf ioctl for most workloads.
>

Daniel's earlier feedback was that it is worth Mesa trying this new
mechanism for gl and see it that works. We want to avoid supporting
execlist support for implicit sync in vm_bind mode from the beginning
if it is going to be deemed not necessary.

>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>
>I seem to recall some conversations where we were told a bunch of
>ioctls would stop working or make no sense to call when using vm_bind.
>Can we please get a complete list of those? Bonus points if the Kernel
>starts telling us we just called something that makes no sense.
>

Which ioctls you are talking about here?
We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
already documented in this patch).

>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>I know we already discussed this, but just to document it publicly: the
>ideal case for user space would be that every BO is created as private
>but then we'd have an ioctl to convert it to non-private (without the
>need to have a non-private->private interface).
>
>An explanation on why we can't have an ioctl to mark as exported a
>buffer that was previously vm_private would be really appreciated.
>

Ok, I can some notes on that.
The reason being the fact that this require changing the dma-resv object
for gem object, hence the object locking also. This will add complications
as we have to sync with any pending operations. It might be easier for
UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

>Thanks,
>Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-23 19:05       ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-05-23 19:08         ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:08 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: Brost, Matthew, Kenneth Graunke, intel-gfx, Wilson, Chris P,
	Hellstrom, Thomas, dri-devel, jason, Vetter, Daniel,
	christian.koenig

On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>v2: Add more documentation and format as per review comments
>>>    from Daniel.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..f1be560d313c
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,304 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>>>+specified address space (VM). These mappings (also referred to as persistent
>>>+mappings) will be persistent across multiple GPU submissions (execbuff calls)
>>>+issued by the UMD, without user having to provide a list of all required
>>>+mappings during each submission (as required by older execbuff mode).
>>>+
>>>+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>>>+to specify how the binding/unbinding should sync with other operations
>>>+like the GPU job submission. These fences will be timeline 'drm_syncobj's
>>>+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>>>+For Compute contexts, they will be user/memory fences (See struct
>>>+drm_i915_vm_bind_ext_user_fence).
>>>+
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+
>>>+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>>+async worker. The binding and unbinding will work like a special GPU engine.
>>>+The binding and unbinding operations are serialized and will wait on specified
>>>+input fences before the operation and will signal the output fences upon the
>>>+completion of the operation. Due to serialization, completion of an operation
>>>+will also indicate that all previous operations are also complete.
>>>+
>>>+VM_BIND features include:
>>>+
>>>+* Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+* VA mapping can map to a partial section of the BO (partial binding).
>>>+* Support capture of persistent mappings in the dump upon GPU error.
>>>+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  use cases will be helpful.
>>>+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>>>+* Support for userptr gem objects (no special uapi is required for this).
>>>+
>>>+Execbuff ioctl in VM_BIND mode
>>>+-------------------------------
>>>+The execbuff ioctl handling in VM_BIND mode differs significantly from the
>>>+older method. A VM in VM_BIND mode will not support older execbuff mode of
>>>+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>>>+no support for implicit sync. It is expected that the below work will be able
>>>+to support requirements of object dependency setting in all use cases:
>>>+
>>>+"dma-buf: Add an API for exporting sync files"
>>>+(https://lwn.net/Articles/859290/)
>>
>>I would really like to have more details here. The link provided points
>>to new ioctls and we're not very familiar with those yet, so I think
>>you should really clarify the interaction between the new additions
>>here. Having some sample code would be really nice too.
>>
>>For Mesa at least (and I believe for the other drivers too) we always
>>have a few exported buffers in every execbuf call, and we rely on the
>>implicit synchronization provided by execbuf to make sure everything
>>works. The execbuf ioctl also has some code to flush caches during
>>implicit synchronization AFAIR, so I would guess we rely on it too and
>>whatever else the Kernel does. Is that covered by the new ioctls?
>>
>>In addition, as far as I remember, one of the big improvements of
>>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>>But if making execbuf faster comes at the cost of requiring additional
>>ioctls calls for implicit synchronization, which is required on ever
>>execbuf call, then I wonder if we'll even get any faster at all.
>>Comparing old execbuf vs plain new execbuf without the new required
>>ioctls won't make sense.
>>
>>But maybe I'm wrong and we won't need to call these new ioctls around
>>every single execbuf ioctl we submit? Again, more clarification and
>>some code examples here would be really nice. This is a big change on
>>an important part of the API, we should clarify the new expected usage.
>>
>
>Thanks Paulo for the comments.
>
>In VM_BIND mode, the only reason we would need execlist support in
>execbuff path is for implicit synchronization. And AFAIK, this work
>from Jason is expected replace implict synchronization with new ioctls.
>Hence, VM_BIND mode will not be needing execlist support at all.
>
>Based on comments from Daniel and my offline sync with Jason, this
>new mechanism from Jason is expected work for vl. For gl, there is a
>question of whether it will be performant or not. But it is worth trying
>that first. If it is not performant for gl, then only we can consider
>adding implicit sync support back for VM_BIND mode.
>
>Daniel, Jason, Ken, any thoughts you can add here?

CC'ing Ken.

>
>>>+
>>>+This also means, we need an execbuff extension to pass in the batch
>>>+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>+
>>>+If at all execlist support in execbuff ioctl is deemed necessary for
>>>+implicit sync in certain use cases, then support can be added later.
>>
>>IMHO we really need to sort this and check all the assumptions before
>>we commit to any interface. Again, implicit synchronization is
>>something we rely on during *every* execbuf ioctl for most workloads.
>>
>
>Daniel's earlier feedback was that it is worth Mesa trying this new
>mechanism for gl and see it that works. We want to avoid supporting
>execlist support for implicit sync in vm_bind mode from the beginning
>if it is going to be deemed not necessary.
>
>>
>>>+In VM_BIND mode, VA allocation is completely managed by the user instead of
>>>+the i915 driver. Hence all VA assignment, eviction are not applicable in
>>>+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>>>+be using the i915_vma active reference tracking. It will instead use dma-resv
>>>+object for that (See `VM_BIND dma_resv usage`_).
>>>+
>>>+So, a lot of existing code in the execbuff path like relocations, VA evictions,
>>>+vma lookup table, implicit sync, vma active reference tracking etc., are not
>>>+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>>>+by clearly separating out the functionalities where the VM_BIND mode differs
>>>+from older method and they should be moved to separate files.
>>
>>I seem to recall some conversations where we were told a bunch of
>>ioctls would stop working or make no sense to call when using vm_bind.
>>Can we please get a complete list of those? Bonus points if the Kernel
>>starts telling us we just called something that makes no sense.
>>
>
>Which ioctls you are talking about here?
>We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
>already documented in this patch).
>
>>>+
>>>+VM_PRIVATE objects
>>>+-------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus, the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>I know we already discussed this, but just to document it publicly: the
>>ideal case for user space would be that every BO is created as private
>>but then we'd have an ioctl to convert it to non-private (without the
>>need to have a non-private->private interface).
>>
>>An explanation on why we can't have an ioctl to mark as exported a
>>buffer that was previously vm_private would be really appreciated.
>>
>
>Ok, I can some notes on that.
>The reason being the fact that this require changing the dma-resv object
>for gem object, hence the object locking also. This will add complications
>as we have to sync with any pending operations. It might be easier for
>UMDs to do it themselves by copying the object contexts to a new object.
>
>Niranjana
>
>>Thanks,
>>Paulo
>>
>>
>>>+
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+The locking design here supports the older (execlist based) execbuff mode, the
>>>+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>>>+system allocator support (See `Shared Virtual Memory (SVM) support`_).
>>>+The older execbuff mode and the newer VM_BIND mode without page faults manages
>>>+residency of backing storage using dma_fence. The VM_BIND mode with page faults
>>>+and the system allocator support do not use any dma_fence at all.
>>>+
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>>>+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>>>+   mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple page fault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+   The older execbuff mode of binding do not need this lock.
>>>+
>>>+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>>>+   be held while binding/unbinding a vma in the async worker and while updating
>>>+   dma-resv fence list of an object. Note that private BOs of a VM will all
>>>+   share a dma-resv object.
>>>+
>>>+   The future system allocator support will use the HMM prescribed locking
>>>+   instead.
>>>+
>>>+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>>>+   invalidated vmas (due to eviction and userptr invalidation) etc.
>>>+
>>>+When GPU page faults are supported, the execbuff path do not take any of these
>>>+locks. There we will simply smash the new batch buffer address into the ring and
>>>+then tell the scheduler run that. The lock taking only happens from the page
>>>+fault handler, where we take lock-A in read mode, whichever lock-B we need to
>>>+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>>>+system allocator) and some additional locks (lock-D) for taking care of page
>>>+table races. Page fault mode should not need to ever manipulate the vm lists,
>>>+so won't ever need lock-C.
>>>+
>>>+VM_BIND LRU handling
>>>+---------------------
>>>+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>>>+performance degradation. We will also need support for bulk LRU movement of
>>>+VM_BIND objects to avoid additional latencies in execbuff path.
>>>+
>>>+The page table pages are similar to VM_BIND mapped objects (See
>>>+`Evictable page table allocations`_) and are maintained per VM and needs to
>>>+be pinned in memory when VM is made active (ie., upon an execbuff call with
>>>+that VM). So, bulk LRU movement of page table pages is also needed.
>>>+
>>>+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>>>+over to the ttm LRU in some fashion to make sure we once again have a reasonable
>>>+and consistent memory aging and reclaim architecture.
>>>+
>>>+VM_BIND dma_resv usage
>>>+-----------------------
>>>+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>>>+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>>>+over sync (See enum dma_resv_usage). One can override it with either
>>>+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>>>+setting (either through explicit or implicit mechanism).
>>>+
>>>+When vm_bind is called for a non-private object while the VM is already
>>>+active, the fences need to be copied from VM's shared dma-resv object
>>>+(common to all private objects of the VM) to this non-private object.
>>>+If this results in performance degradation, then some optimization will
>>>+be needed here. This is not a problem for VM's private objects as they use
>>>+shared dma-resv object which is always updated on each execbuff submission.
>>>+
>>>+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>>>+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>>>+older i915_vma active reference tracking which is deprecated. This should be
>>>+easier to get it working with the current TTM backend. We can remove the
>>>+i915_vma active reference tracking fully while supporting TTM backend for igfx.
>>>+
>>>+Evictable page table allocations
>>>+---------------------------------
>>>+Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+mapped objects. Page table pages are similar to persistent mappings of a
>>>+VM (difference here are that the page table pages will not have an i915_vma
>>>+structure and after swapping pages back in, parent page link needs to be
>>>+updated).
>>>+
>>>+Mesa use case
>>>+--------------
>>>+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>>>+hence improving performance of CPU-bound applications. It also allows us to
>>>+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>>>+reducing CPU overhead becomes more impactful.
>>>+
>>>+
>>>+VM_BIND Compute support
>>>+========================
>>>+
>>>+User/Memory Fence
>>>+------------------
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter. User/Memory fence is a <address, value> pair. To signal the
>>>+user fence, specified value will be written at the specified virtual address
>>>+and wakeup the waiting process. User can wait on a user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal a user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>>>+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>>>+context creation. The dma-fence based user interfaces like gem_wait ioctl and
>>>+execbuff out fence are not allowed on long running contexts. Implicit sync is
>>>+not valid as well and is anyway not supported in VM_BIND mode.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context preempt fence (also called suspend fence) proxying
>>>+as i915_request fence. This suspend fence is enabled when someone tries to wait
>>>+on it, which then triggers the context preemption.
>>>+
>>>+As this support for context suspension using a preempt fence and the resume work
>>>+for the compute mode contexts can get tricky to get it right, it is better to
>>>+add this support in drm scheduler so that multiple drivers can make use of it.
>>>+That means, it will have a dependency on i915 drm scheduler conversion with GuC
>>>+scheduler backend. This should be fine, as the plan is to support compute mode
>>>+contexts only with GuC scheduler backend (at least initially). This is much
>>>+easier to support with VM_BIND mode compared to the current heavier execbuff
>>>+path resource attachment.
>>>+
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. This is made possible by VM_BIND is not being synchronized against
>>>+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>>>+submitted jobs.
>>>+
>>>+Other VM_BIND use cases
>>>+========================
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debugged) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+GPU page faults
>>>+----------------
>>>+GPU page faults when supported (in future), will only be supported in the
>>>+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>>>+binding will require using dma-fence to ensure residency, the GPU page faults
>>>+mode when supported, will not use any dma-fence as residency is purely managed
>>>+by installing and removing/invalidating page table entries.
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only mapping, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface. SVM is only supported with GPU page
>>>+faults enabled.
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+use cases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>>>+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>>>+  is active or not.
>>>+
>>>+
>>>+VM_BIND UAPI
>>>+=============
>>>+
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-23 19:08         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:08 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: Kenneth Graunke, intel-gfx, Wilson, Chris P, Hellstrom, Thomas,
	dri-devel, Vetter, Daniel, christian.koenig

On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>v2: Add more documentation and format as per review comments
>>>    from Daniel.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..f1be560d313c
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,304 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>>>+specified address space (VM). These mappings (also referred to as persistent
>>>+mappings) will be persistent across multiple GPU submissions (execbuff calls)
>>>+issued by the UMD, without user having to provide a list of all required
>>>+mappings during each submission (as required by older execbuff mode).
>>>+
>>>+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>>>+to specify how the binding/unbinding should sync with other operations
>>>+like the GPU job submission. These fences will be timeline 'drm_syncobj's
>>>+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>>>+For Compute contexts, they will be user/memory fences (See struct
>>>+drm_i915_vm_bind_ext_user_fence).
>>>+
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+
>>>+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>>+async worker. The binding and unbinding will work like a special GPU engine.
>>>+The binding and unbinding operations are serialized and will wait on specified
>>>+input fences before the operation and will signal the output fences upon the
>>>+completion of the operation. Due to serialization, completion of an operation
>>>+will also indicate that all previous operations are also complete.
>>>+
>>>+VM_BIND features include:
>>>+
>>>+* Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+* VA mapping can map to a partial section of the BO (partial binding).
>>>+* Support capture of persistent mappings in the dump upon GPU error.
>>>+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  use cases will be helpful.
>>>+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>>>+* Support for userptr gem objects (no special uapi is required for this).
>>>+
>>>+Execbuff ioctl in VM_BIND mode
>>>+-------------------------------
>>>+The execbuff ioctl handling in VM_BIND mode differs significantly from the
>>>+older method. A VM in VM_BIND mode will not support older execbuff mode of
>>>+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>>>+no support for implicit sync. It is expected that the below work will be able
>>>+to support requirements of object dependency setting in all use cases:
>>>+
>>>+"dma-buf: Add an API for exporting sync files"
>>>+(https://lwn.net/Articles/859290/)
>>
>>I would really like to have more details here. The link provided points
>>to new ioctls and we're not very familiar with those yet, so I think
>>you should really clarify the interaction between the new additions
>>here. Having some sample code would be really nice too.
>>
>>For Mesa at least (and I believe for the other drivers too) we always
>>have a few exported buffers in every execbuf call, and we rely on the
>>implicit synchronization provided by execbuf to make sure everything
>>works. The execbuf ioctl also has some code to flush caches during
>>implicit synchronization AFAIR, so I would guess we rely on it too and
>>whatever else the Kernel does. Is that covered by the new ioctls?
>>
>>In addition, as far as I remember, one of the big improvements of
>>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>>But if making execbuf faster comes at the cost of requiring additional
>>ioctls calls for implicit synchronization, which is required on ever
>>execbuf call, then I wonder if we'll even get any faster at all.
>>Comparing old execbuf vs plain new execbuf without the new required
>>ioctls won't make sense.
>>
>>But maybe I'm wrong and we won't need to call these new ioctls around
>>every single execbuf ioctl we submit? Again, more clarification and
>>some code examples here would be really nice. This is a big change on
>>an important part of the API, we should clarify the new expected usage.
>>
>
>Thanks Paulo for the comments.
>
>In VM_BIND mode, the only reason we would need execlist support in
>execbuff path is for implicit synchronization. And AFAIK, this work
>from Jason is expected replace implict synchronization with new ioctls.
>Hence, VM_BIND mode will not be needing execlist support at all.
>
>Based on comments from Daniel and my offline sync with Jason, this
>new mechanism from Jason is expected work for vl. For gl, there is a
>question of whether it will be performant or not. But it is worth trying
>that first. If it is not performant for gl, then only we can consider
>adding implicit sync support back for VM_BIND mode.
>
>Daniel, Jason, Ken, any thoughts you can add here?

CC'ing Ken.

>
>>>+
>>>+This also means, we need an execbuff extension to pass in the batch
>>>+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>+
>>>+If at all execlist support in execbuff ioctl is deemed necessary for
>>>+implicit sync in certain use cases, then support can be added later.
>>
>>IMHO we really need to sort this and check all the assumptions before
>>we commit to any interface. Again, implicit synchronization is
>>something we rely on during *every* execbuf ioctl for most workloads.
>>
>
>Daniel's earlier feedback was that it is worth Mesa trying this new
>mechanism for gl and see it that works. We want to avoid supporting
>execlist support for implicit sync in vm_bind mode from the beginning
>if it is going to be deemed not necessary.
>
>>
>>>+In VM_BIND mode, VA allocation is completely managed by the user instead of
>>>+the i915 driver. Hence all VA assignment, eviction are not applicable in
>>>+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>>>+be using the i915_vma active reference tracking. It will instead use dma-resv
>>>+object for that (See `VM_BIND dma_resv usage`_).
>>>+
>>>+So, a lot of existing code in the execbuff path like relocations, VA evictions,
>>>+vma lookup table, implicit sync, vma active reference tracking etc., are not
>>>+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>>>+by clearly separating out the functionalities where the VM_BIND mode differs
>>>+from older method and they should be moved to separate files.
>>
>>I seem to recall some conversations where we were told a bunch of
>>ioctls would stop working or make no sense to call when using vm_bind.
>>Can we please get a complete list of those? Bonus points if the Kernel
>>starts telling us we just called something that makes no sense.
>>
>
>Which ioctls you are talking about here?
>We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
>already documented in this patch).
>
>>>+
>>>+VM_PRIVATE objects
>>>+-------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus, the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>I know we already discussed this, but just to document it publicly: the
>>ideal case for user space would be that every BO is created as private
>>but then we'd have an ioctl to convert it to non-private (without the
>>need to have a non-private->private interface).
>>
>>An explanation on why we can't have an ioctl to mark as exported a
>>buffer that was previously vm_private would be really appreciated.
>>
>
>Ok, I can some notes on that.
>The reason being the fact that this require changing the dma-resv object
>for gem object, hence the object locking also. This will add complications
>as we have to sync with any pending operations. It might be easier for
>UMDs to do it themselves by copying the object contexts to a new object.
>
>Niranjana
>
>>Thanks,
>>Paulo
>>
>>
>>>+
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+The locking design here supports the older (execlist based) execbuff mode, the
>>>+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>>>+system allocator support (See `Shared Virtual Memory (SVM) support`_).
>>>+The older execbuff mode and the newer VM_BIND mode without page faults manages
>>>+residency of backing storage using dma_fence. The VM_BIND mode with page faults
>>>+and the system allocator support do not use any dma_fence at all.
>>>+
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>>>+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>>>+   mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple page fault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+   The older execbuff mode of binding do not need this lock.
>>>+
>>>+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>>>+   be held while binding/unbinding a vma in the async worker and while updating
>>>+   dma-resv fence list of an object. Note that private BOs of a VM will all
>>>+   share a dma-resv object.
>>>+
>>>+   The future system allocator support will use the HMM prescribed locking
>>>+   instead.
>>>+
>>>+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>>>+   invalidated vmas (due to eviction and userptr invalidation) etc.
>>>+
>>>+When GPU page faults are supported, the execbuff path do not take any of these
>>>+locks. There we will simply smash the new batch buffer address into the ring and
>>>+then tell the scheduler run that. The lock taking only happens from the page
>>>+fault handler, where we take lock-A in read mode, whichever lock-B we need to
>>>+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>>>+system allocator) and some additional locks (lock-D) for taking care of page
>>>+table races. Page fault mode should not need to ever manipulate the vm lists,
>>>+so won't ever need lock-C.
>>>+
>>>+VM_BIND LRU handling
>>>+---------------------
>>>+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>>>+performance degradation. We will also need support for bulk LRU movement of
>>>+VM_BIND objects to avoid additional latencies in execbuff path.
>>>+
>>>+The page table pages are similar to VM_BIND mapped objects (See
>>>+`Evictable page table allocations`_) and are maintained per VM and needs to
>>>+be pinned in memory when VM is made active (ie., upon an execbuff call with
>>>+that VM). So, bulk LRU movement of page table pages is also needed.
>>>+
>>>+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>>>+over to the ttm LRU in some fashion to make sure we once again have a reasonable
>>>+and consistent memory aging and reclaim architecture.
>>>+
>>>+VM_BIND dma_resv usage
>>>+-----------------------
>>>+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>>>+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>>>+over sync (See enum dma_resv_usage). One can override it with either
>>>+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>>>+setting (either through explicit or implicit mechanism).
>>>+
>>>+When vm_bind is called for a non-private object while the VM is already
>>>+active, the fences need to be copied from VM's shared dma-resv object
>>>+(common to all private objects of the VM) to this non-private object.
>>>+If this results in performance degradation, then some optimization will
>>>+be needed here. This is not a problem for VM's private objects as they use
>>>+shared dma-resv object which is always updated on each execbuff submission.
>>>+
>>>+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>>>+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>>>+older i915_vma active reference tracking which is deprecated. This should be
>>>+easier to get it working with the current TTM backend. We can remove the
>>>+i915_vma active reference tracking fully while supporting TTM backend for igfx.
>>>+
>>>+Evictable page table allocations
>>>+---------------------------------
>>>+Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+mapped objects. Page table pages are similar to persistent mappings of a
>>>+VM (difference here are that the page table pages will not have an i915_vma
>>>+structure and after swapping pages back in, parent page link needs to be
>>>+updated).
>>>+
>>>+Mesa use case
>>>+--------------
>>>+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>>>+hence improving performance of CPU-bound applications. It also allows us to
>>>+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>>>+reducing CPU overhead becomes more impactful.
>>>+
>>>+
>>>+VM_BIND Compute support
>>>+========================
>>>+
>>>+User/Memory Fence
>>>+------------------
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter. User/Memory fence is a <address, value> pair. To signal the
>>>+user fence, specified value will be written at the specified virtual address
>>>+and wakeup the waiting process. User can wait on a user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal a user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>>>+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>>>+context creation. The dma-fence based user interfaces like gem_wait ioctl and
>>>+execbuff out fence are not allowed on long running contexts. Implicit sync is
>>>+not valid as well and is anyway not supported in VM_BIND mode.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context preempt fence (also called suspend fence) proxying
>>>+as i915_request fence. This suspend fence is enabled when someone tries to wait
>>>+on it, which then triggers the context preemption.
>>>+
>>>+As this support for context suspension using a preempt fence and the resume work
>>>+for the compute mode contexts can get tricky to get it right, it is better to
>>>+add this support in drm scheduler so that multiple drivers can make use of it.
>>>+That means, it will have a dependency on i915 drm scheduler conversion with GuC
>>>+scheduler backend. This should be fine, as the plan is to support compute mode
>>>+contexts only with GuC scheduler backend (at least initially). This is much
>>>+easier to support with VM_BIND mode compared to the current heavier execbuff
>>>+path resource attachment.
>>>+
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. This is made possible by VM_BIND is not being synchronized against
>>>+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>>>+submitted jobs.
>>>+
>>>+Other VM_BIND use cases
>>>+========================
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debugged) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+GPU page faults
>>>+----------------
>>>+GPU page faults when supported (in future), will only be supported in the
>>>+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>>>+binding will require using dma-fence to ensure residency, the GPU page faults
>>>+mode when supported, will not use any dma-fence as residency is purely managed
>>>+by installing and removing/invalidating page table entries.
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only mapping, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface. SVM is only supported with GPU page
>>>+faults enabled.
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+use cases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>>>+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>>>+  is active or not.
>>>+
>>>+
>>>+VM_BIND UAPI
>>>+=============
>>>+
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-19 23:07   ` [Intel-gfx] " Zanoni, Paulo R
@ 2022-05-23 19:19     ` Niranjana Vishwanathapura
  2022-06-01  9:02       ` Dave Airlie
  0 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-23 19:19 UTC (permalink / raw)
  To: Zanoni, Paulo R
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	Daniel, christian.koenig

On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND and related uapi definitions
>>
>> v2: Ensure proper kernel-doc formatting with cross references.
>>     Also add new uapi and documentation as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> new file mode 100644
>> index 000000000000..589c0a009107
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> @@ -0,0 +1,399 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/**
>> + * DOC: I915_PARAM_HAS_VM_BIND
>> + *
>> + * VM_BIND feature availability.
>> + * See typedef drm_i915_getparam_t param.
>> + */
>> +#define I915_PARAM_HAS_VM_BIND               57
>> +
>> +/**
>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> + *
>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> + * See struct drm_i915_gem_vm_control flags.
>> + *
>> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>> + * to pass in the batch buffer addresses.
>> + *
>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>> + */
>
>From that description, it seems we have:
>
>struct drm_i915_gem_execbuffer2 {
>        __u64 buffers_ptr;              -> must be 0 (new)
>        __u32 buffer_count;             -> must be 0 (new)
>        __u32 batch_start_offset;       -> must be 0 (new)
>        __u32 batch_len;                -> must be 0 (new)
>        __u32 DR1;                      -> must be 0 (old)
>        __u32 DR4;                      -> must be 0 (old)
>        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>        __u64 flags;                    -> some flags must be 0 (new)
>        __u64 rsvd1; (context info)     -> repurposed field (old)
>        __u64 rsvd2;                    -> unused
>};
>
>Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>of adding even more complexity to an already abused interface? While
>the Vulkan-like extension thing is really nice, I don't think what
>we're doing here is extending the ioctl usage, we're completely
>changing how the base struct should be interpreted based on how the VM
>was created (which is an entirely different ioctl).
>
>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>already at -6 without these changes. I think after vm_bind we'll need
>to create a -11 entry just to deal with this ioctl.
>

The only change here is removing the execlist support for VM_BIND
mode (other than natual extensions).
Adding a new execbuffer3 was considered, but I think we need to be careful
with that as that goes beyond the VM_BIND support, including any future
requirements (as we don't want an execbuffer4 after VM_BIND).

Niranjana

>
>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND       (1 << 0)
>+
>+/**
>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>+ *
>+ * Flag to declare context as long running.
>+ * See struct drm_i915_gem_context_create_ext flags.
>+ *
>+ * Usage of dma-fence expects that they complete in reasonable amount of time.
>+ * Compute on the other hand can be long running. Hence it is not appropriate
>+ * for compute contexts to export request completion dma-fence to user.
>+ * The dma-fence usage will be limited to in-kernel consumption only.
>+ * Compute contexts need to use user/memory fence.
>+ *
>+ * So, long running contexts do not support output fences. Hence,
>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
>+ * to be not used.
>+ *
>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
>+ * to long running contexts.
>+ */
>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>+
>+/* VM_BIND related ioctls */
>+#define DRM_I915_GEM_VM_BIND           0x3d
>+#define DRM_I915_GEM_VM_UNBIND         0x3e
>+#define DRM_I915_GEM_WAIT_USER_FENCE   0x3f
>+
>+#define DRM_IOCTL_I915_GEM_VM_BIND             DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>+#define DRM_IOCTL_I915_GEM_VM_UNBIND           DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE     DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
>+
>+/**
>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>+ *
>+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
>+ * virtual address (VA) range to the section of an object that should be bound
>+ * in the device page table of the specified address space (VM).
>+ * The VA range specified must be unique (ie., not currently bound) and can
>+ * be mapped to whole object or a section of the object (partial binding).
>+ * Multiple VA mappings can be created to the same section of the object
>+ * (aliasing).
>+ */
>+struct drm_i915_gem_vm_bind {
>+       /** @vm_id: VM (address space) id to bind */
>+       __u32 vm_id;
>+
>+       /** @handle: Object handle */
>+       __u32 handle;
>+
>+       /** @start: Virtual Address start to bind */
>+       __u64 start;
>+
>+       /** @offset: Offset in object to bind */
>+       __u64 offset;
>+
>+       /** @length: Length of mapping to bind */
>+       __u64 length;
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_GEM_VM_BIND_READONLY:
>+        * Mapping is read-only.
>+        *
>+        * I915_GEM_VM_BIND_CAPTURE:
>+        * Capture this mapping in the dump upon GPU error.
>+        */
>+       __u64 flags;
>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>+
>+       /** @extensions: 0-terminated chain of extensions for this mapping. */
>+       __u64 extensions;
>+};
>+
>+/**
>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>+ *
>+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
>+ * address (VA) range that should be unbound from the device page table of the
>+ * specified address space (VM). The specified VA range must match one of the
>+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>+ * completion.
>+ */
>+struct drm_i915_gem_vm_unbind {
>+       /** @vm_id: VM (address space) id to bind */
>+       __u32 vm_id;
>+
>+       /** @rsvd: Reserved for future use; must be zero. */
>+       __u32 rsvd;
>+
>+       /** @start: Virtual Address start to unbind */
>+       __u64 start;
>+
>+       /** @length: Length of mapping to unbind */
>+       __u64 length;
>+
>+       /** @flags: reserved for future usage, currently MBZ */
>+       __u64 flags;
>+
>+       /** @extensions: 0-terminated chain of extensions for this mapping. */
>+       __u64 extensions;
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
>+ * or the vm_unbind work.
>+ *
>+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
>+ * before starting the binding or unbinding.
>+ *
>+ * The vm_bind or vm_unbind async worker will signal the returned output fence
>+ * after the completion of binding or unbinding.
>+ */
>+struct drm_i915_vm_bind_fence {
>+       /** @handle: User's handle for a drm_syncobj to wait on or signal. */
>+       __u32 handle;
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_VM_BIND_FENCE_WAIT:
>+        * Wait for the input fence before binding/unbinding
>+        *
>+        * I915_VM_BIND_FENCE_SIGNAL:
>+        * Return bind/unbind completion fence as output
>+        */
>+       __u32 flags;
>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
>+ * and vm_unbind.
>+ *
>+ * This structure describes an array of timeline drm_syncobj and associated
>+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
>+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
>+ */
>+struct drm_i915_vm_bind_ext_timeline_fences {
>+#define I915_VM_BIND_EXT_timeline_FENCES       0
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /**
>+        * @fence_count: Number of elements in the @handles_ptr & @value_ptr
>+        * arrays.
>+        */
>+       __u64 fence_count;
>+
>+       /**
>+        * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
>+        * of length @fence_count.
>+        */
>+       __u64 handles_ptr;
>+
>+       /**
>+        * @values_ptr: Pointer to an array of u64 values of length
>+        * @fence_count.
>+        * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>+        * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>+        * binary one.
>+        */
>+       __u64 values_ptr;
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
>+ * vm_bind or the vm_unbind work.
>+ *
>+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
>+ * @addr to become equal to @val) before starting the binding or unbinding.
>+ *
>+ * The vm_bind or vm_unbind async worker will signal the output fence after
>+ * the completion of binding or unbinding by writing @val to memory location at
>+ * @addr
>+ */
>+struct drm_i915_vm_bind_user_fence {
>+       /** @addr: User/Memory fence qword aligned process virtual address */
>+       __u64 addr;
>+
>+       /** @val: User/Memory fence value to be written after bind completion */
>+       __u64 val;
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_VM_BIND_USER_FENCE_WAIT:
>+        * Wait for the input fence before binding/unbinding
>+        *
>+        * I915_VM_BIND_USER_FENCE_SIGNAL:
>+        * Return bind/unbind completion fence as output
>+        */
>+       __u32 flags;
>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>+       (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>+};
>+
>+/**
>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
>+ * and vm_unbind.
>+ *
>+ * These user fences can be input or output fences
>+ * (See struct drm_i915_vm_bind_user_fence).
>+ */
>+struct drm_i915_vm_bind_ext_user_fence {
>+#define I915_VM_BIND_EXT_USER_FENCES   1
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /** @fence_count: Number of elements in the @user_fence_ptr array. */
>+       __u64 fence_count;
>+
>+       /**
>+        * @user_fence_ptr: Pointer to an array of
>+        * struct drm_i915_vm_bind_user_fence of length @fence_count.
>+        */
>+       __u64 user_fence_ptr;
>+};
>+
>+/**
>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
>+ * gpu virtual addresses.
>+ *
>+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
>+ * must always be appended in the VM_BIND mode and it will be an error to
>+ * append this extension in older non-VM_BIND mode.
>+ */
>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /** @count: Number of addresses in the addr array. */
>+       __u32 count;
>+
>+       /** @addr: An array of batch gpu virtual addresses. */
>+       __u64 addr[0];
>+};
>+
>+/**
>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
>+ * signaling extension.
>+ *
>+ * This extension allows user to attach a user fence (@addr, @value pair) to an
>+ * execbuf to be signaled by the command streamer after the completion of first
>+ * level batch, by writing the @value at specified @addr and triggering an
>+ * interrupt.
>+ * User can either poll for this user fence to signal or can also wait on it
>+ * with i915_gem_wait_user_fence ioctl.
>+ * This is very much usefaul for long running contexts where waiting on dma-fence
>+ * by user (like i915_gem_wait ioctl) is not supported.
>+ */
>+struct drm_i915_gem_execbuffer_ext_user_fence {
>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE         2
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /**
>+        * @addr: User/Memory fence qword aligned GPU virtual address.
>+        *
>+        * Address has to be a valid GPU virtual address at the time of
>+        * first level batch completion.
>+        */
>+       __u64 addr;
>+
>+       /**
>+        * @value: User/Memory fence Value to be written to above address
>+        * after first level batch completes.
>+        */
>+       __u64 value;
>+
>+       /** @rsvd: Reserved for future extensions, MBZ */
>+       __u64 rsvd;
>+};
>+
>+/**
>+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
>+ * private to the specified VM.
>+ *
>+ * See struct drm_i915_gem_create_ext.
>+ */
>+struct drm_i915_gem_create_ext_vm_private {
>+#define I915_GEM_CREATE_EXT_VM_PRIVATE         2
>+       /** @base: Extension link. See struct i915_user_extension. */
>+       struct i915_user_extension base;
>+
>+       /** @vm_id: Id of the VM to which the object is private */
>+       __u32 vm_id;
>+};
>+
>+/**
>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>+ *
>+ * User/Memory fence can be woken up either by:
>+ *
>+ * 1. GPU context indicated by @ctx_id, or,
>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>+ *    @ctx_id is ignored when this flag is set.
>+ *
>+ * Wakeup condition is,
>+ * ``((*addr & mask) op (value & mask))``
>+ *
>+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
>+ */
>+struct drm_i915_gem_wait_user_fence {
>+       /** @extensions: Zero-terminated chain of extensions. */
>+       __u64 extensions;
>+
>+       /** @addr: User/Memory fence address */
>+       __u64 addr;
>+
>+       /** @ctx_id: Id of the Context which will signal the fence. */
>+       __u32 ctx_id;
>+
>+       /** @op: Wakeup condition operator */
>+       __u16 op;
>+#define I915_UFENCE_WAIT_EQ      0
>+#define I915_UFENCE_WAIT_NEQ     1
>+#define I915_UFENCE_WAIT_GT      2
>+#define I915_UFENCE_WAIT_GTE     3
>+#define I915_UFENCE_WAIT_LT      4
>+#define I915_UFENCE_WAIT_LTE     5
>+#define I915_UFENCE_WAIT_BEFORE  6
>+#define I915_UFENCE_WAIT_AFTER   7
>+
>+       /**
>+        * @flags: Supported flags are,
>+        *
>+        * I915_UFENCE_WAIT_SOFT:
>+        *
>+        * To be woken up by i915 driver async worker (not by GPU).
>+        *
>+        * I915_UFENCE_WAIT_ABSTIME:
>+        *
>+        * Wait timeout specified as absolute time.
>+        */
>+       __u16 flags;
>+#define I915_UFENCE_WAIT_SOFT    0x1
>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>+
>+       /** @value: Wakeup value */
>+       __u64 value;
>+
>+       /** @mask: Wakeup mask */
>+       __u64 mask;
>+#define I915_UFENCE_WAIT_U8     0xffu
>+#define I915_UFENCE_WAIT_U16    0xffffu
>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>+
>+       /**
>+        * @timeout: Wait timeout in nanoseconds.
>+        *
>+        * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
>+        * absolute time in nsec.
>+        */
>+       __s64 timeout;
>+};
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-19 22:52     ` [Intel-gfx] " Zanoni, Paulo R
  (?)
  (?)
@ 2022-05-24 10:08     ` Lionel Landwerlin
  -1 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-05-24 10:08 UTC (permalink / raw)
  To: Zanoni, Paulo R, dri-devel, Vetter, Daniel, Vishwanathapura,
	Niranjana, intel-gfx
  Cc: Hellstrom, Thomas, christian.koenig, Wilson, Chris P

On 20/05/2022 01:52, Zanoni, Paulo R wrote:
> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>      from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
> I would really like to have more details here. The link provided points
> to new ioctls and we're not very familiar with those yet, so I think
> you should really clarify the interaction between the new additions
> here. Having some sample code would be really nice too.
>
> For Mesa at least (and I believe for the other drivers too) we always
> have a few exported buffers in every execbuf call, and we rely on the
> implicit synchronization provided by execbuf to make sure everything
> works. The execbuf ioctl also has some code to flush caches during
> implicit synchronization AFAIR, so I would guess we rely on it too and
> whatever else the Kernel does. Is that covered by the new ioctls?
>
> In addition, as far as I remember, one of the big improvements of
> vm_bind was that it would help reduce ioctl latency and cpu overhead.
> But if making execbuf faster comes at the cost of requiring additional
> ioctls calls for implicit synchronization, which is required on ever
> execbuf call, then I wonder if we'll even get any faster at all.
> Comparing old execbuf vs plain new execbuf without the new required
> ioctls won't make sense.
> But maybe I'm wrong and we won't need to call these new ioctls around
> every single execbuf ioctl we submit? Again, more clarification and
> some code examples here would be really nice. This is a big change on
> an important part of the API, we should clarify the new expected usage.


Hey Paulo,


I think in the case of X11/Wayland, we'll be doing 1 or 2 extra ioctls 
per frame which seems pretty reasonable.

Essentially we need to set the dependencies on the buffer we´re going to 
tell the display engine (gnome-shell/kde/bare-display-hw) to use.


In the Vulkan case, we're trading building execbuffer lists of 
potentially thousands of buffers for every single submission versus 1 or 
2 ioctls for a single item when doing vkQueuePresent() (which happens 
less often than we do execbuffer ioctls).

That seems like a good trade off and doesn't look like a lot more work 
than explicit fencing where we would have to send associated fences.


Here is the Mesa MR associated with this : 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037


-Lionel


>
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
> IMHO we really need to sort this and check all the assumptions before
> we commit to any interface. Again, implicit synchronization is
> something we rely on during *every* execbuf ioctl for most workloads.
>
>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
> I seem to recall some conversations where we were told a bunch of
> ioctls would stop working or make no sense to call when using vm_bind.
> Can we please get a complete list of those? Bonus points if the Kernel
> starts telling us we just called something that makes no sense.
>
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
> I know we already discussed this, but just to document it publicly: the
> ideal case for user space would be that every BO is created as private
> but then we'd have an ioctl to convert it to non-private (without the
> need to have a non-private->private interface).
>
> An explanation on why we can't have an ioctl to mark as exported a
> buffer that was previously vm_private would be really appreciated.
>
> Thanks,
> Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>   .. toctree::
>>   
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>       i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-23 19:19     ` Niranjana Vishwanathapura
@ 2022-06-01  9:02       ` Dave Airlie
  2022-06-01  9:27           ` Daniel Vetter
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Airlie @ 2022-06-01  9:02 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:
>
> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> >> VM_BIND and related uapi definitions
> >>
> >> v2: Ensure proper kernel-doc formatting with cross references.
> >>     Also add new uapi and documentation as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
> >>  1 file changed, 399 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> >> new file mode 100644
> >> index 000000000000..589c0a009107
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> >> @@ -0,0 +1,399 @@
> >> +/* SPDX-License-Identifier: MIT */
> >> +/*
> >> + * Copyright © 2022 Intel Corporation
> >> + */
> >> +
> >> +/**
> >> + * DOC: I915_PARAM_HAS_VM_BIND
> >> + *
> >> + * VM_BIND feature availability.
> >> + * See typedef drm_i915_getparam_t param.
> >> + */
> >> +#define I915_PARAM_HAS_VM_BIND               57
> >> +
> >> +/**
> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> + *
> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> >> + * See struct drm_i915_gem_vm_control flags.
> >> + *
> >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> >> + * to pass in the batch buffer addresses.
> >> + *
> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> >> + */
> >
> >From that description, it seems we have:
> >
> >struct drm_i915_gem_execbuffer2 {
> >        __u64 buffers_ptr;              -> must be 0 (new)
> >        __u32 buffer_count;             -> must be 0 (new)
> >        __u32 batch_start_offset;       -> must be 0 (new)
> >        __u32 batch_len;                -> must be 0 (new)
> >        __u32 DR1;                      -> must be 0 (old)
> >        __u32 DR4;                      -> must be 0 (old)
> >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
> >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
> >        __u64 flags;                    -> some flags must be 0 (new)
> >        __u64 rsvd1; (context info)     -> repurposed field (old)
> >        __u64 rsvd2;                    -> unused
> >};
> >
> >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
> >of adding even more complexity to an already abused interface? While
> >the Vulkan-like extension thing is really nice, I don't think what
> >we're doing here is extending the ioctl usage, we're completely
> >changing how the base struct should be interpreted based on how the VM
> >was created (which is an entirely different ioctl).
> >
> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> >already at -6 without these changes. I think after vm_bind we'll need
> >to create a -11 entry just to deal with this ioctl.
> >
>
> The only change here is removing the execlist support for VM_BIND
> mode (other than natual extensions).
> Adding a new execbuffer3 was considered, but I think we need to be careful
> with that as that goes beyond the VM_BIND support, including any future
> requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different
than adding new ioctls.

I definitely think this deserves an execbuffer3 without even
considering future requirements. Just  to burn down the old
requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
older sw on execbuf2 for ever.

Dave.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-01  9:02       ` Dave Airlie
@ 2022-06-01  9:27           ` Daniel Vetter
  0 siblings, 0 replies; 121+ messages in thread
From: Daniel Vetter @ 2022-06-01  9:27 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, Niranjana Vishwanathapura,
	christian.koenig

On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>
> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com> wrote:
> >
> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> > >> VM_BIND and related uapi definitions
> > >>
> > >> v2: Ensure proper kernel-doc formatting with cross references.
> > >>     Also add new uapi and documentation as per review comments
> > >>     from Daniel.
> > >>
> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > >> ---
> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
> > >>  1 file changed, 399 insertions(+)
> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> > >>
> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> new file mode 100644
> > >> index 000000000000..589c0a009107
> > >> --- /dev/null
> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> @@ -0,0 +1,399 @@
> > >> +/* SPDX-License-Identifier: MIT */
> > >> +/*
> > >> + * Copyright © 2022 Intel Corporation
> > >> + */
> > >> +
> > >> +/**
> > >> + * DOC: I915_PARAM_HAS_VM_BIND
> > >> + *
> > >> + * VM_BIND feature availability.
> > >> + * See typedef drm_i915_getparam_t param.
> > >> + */
> > >> +#define I915_PARAM_HAS_VM_BIND               57
> > >> +
> > >> +/**
> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> > >> + *
> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> > >> + * See struct drm_i915_gem_vm_control flags.
> > >> + *
> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> > >> + * to pass in the batch buffer addresses.
> > >> + *
> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> > >> + */
> > >
> > >From that description, it seems we have:
> > >
> > >struct drm_i915_gem_execbuffer2 {
> > >        __u64 buffers_ptr;              -> must be 0 (new)
> > >        __u32 buffer_count;             -> must be 0 (new)
> > >        __u32 batch_start_offset;       -> must be 0 (new)
> > >        __u32 batch_len;                -> must be 0 (new)
> > >        __u32 DR1;                      -> must be 0 (old)
> > >        __u32 DR4;                      -> must be 0 (old)
> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
> > >        __u64 flags;                    -> some flags must be 0 (new)
> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
> > >        __u64 rsvd2;                    -> unused
> > >};
> > >
> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
> > >of adding even more complexity to an already abused interface? While
> > >the Vulkan-like extension thing is really nice, I don't think what
> > >we're doing here is extending the ioctl usage, we're completely
> > >changing how the base struct should be interpreted based on how the VM
> > >was created (which is an entirely different ioctl).
> > >
> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> > >already at -6 without these changes. I think after vm_bind we'll need
> > >to create a -11 entry just to deal with this ioctl.
> > >
> >
> > The only change here is removing the execlist support for VM_BIND
> > mode (other than natual extensions).
> > Adding a new execbuffer3 was considered, but I think we need to be careful
> > with that as that goes beyond the VM_BIND support, including any future
> > requirements (as we don't want an execbuffer4 after VM_BIND).
>
> Why not? it's not like adding extensions here is really that different
> than adding new ioctls.
>
> I definitely think this deserves an execbuffer3 without even
> considering future requirements. Just  to burn down the old
> requirements and pointless fields.
>
> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
> older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less
midlayer. If we share the entry point then there's quite a few vfuncs
needed to cleanly split out the vm_bind paths from the legacy
reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl
vfunc, and then we share code (where it even makes sense, probably
request setup/submit need to be shared, anything else is probably
cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like
i915_active on the vma or vma open counts and all that stuff leaks
into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier,
or at least more flexible, since it should make it easier to have the
upstream vm_bind co-exist with all the other things we have. Without
huge amounts of conflicts (or at least much less) that pushing a pile
of vfuncs into the existing code would cause.

So maybe we should do this?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-01  9:27           ` Daniel Vetter
  0 siblings, 0 replies; 121+ messages in thread
From: Daniel Vetter @ 2022-06-01  9:27 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>
> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com> wrote:
> >
> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> > >> VM_BIND and related uapi definitions
> > >>
> > >> v2: Ensure proper kernel-doc formatting with cross references.
> > >>     Also add new uapi and documentation as per review comments
> > >>     from Daniel.
> > >>
> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > >> ---
> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
> > >>  1 file changed, 399 insertions(+)
> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> > >>
> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> new file mode 100644
> > >> index 000000000000..589c0a009107
> > >> --- /dev/null
> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> > >> @@ -0,0 +1,399 @@
> > >> +/* SPDX-License-Identifier: MIT */
> > >> +/*
> > >> + * Copyright © 2022 Intel Corporation
> > >> + */
> > >> +
> > >> +/**
> > >> + * DOC: I915_PARAM_HAS_VM_BIND
> > >> + *
> > >> + * VM_BIND feature availability.
> > >> + * See typedef drm_i915_getparam_t param.
> > >> + */
> > >> +#define I915_PARAM_HAS_VM_BIND               57
> > >> +
> > >> +/**
> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> > >> + *
> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> > >> + * See struct drm_i915_gem_vm_control flags.
> > >> + *
> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> > >> + * to pass in the batch buffer addresses.
> > >> + *
> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> > >> + */
> > >
> > >From that description, it seems we have:
> > >
> > >struct drm_i915_gem_execbuffer2 {
> > >        __u64 buffers_ptr;              -> must be 0 (new)
> > >        __u32 buffer_count;             -> must be 0 (new)
> > >        __u32 batch_start_offset;       -> must be 0 (new)
> > >        __u32 batch_len;                -> must be 0 (new)
> > >        __u32 DR1;                      -> must be 0 (old)
> > >        __u32 DR4;                      -> must be 0 (old)
> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
> > >        __u64 flags;                    -> some flags must be 0 (new)
> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
> > >        __u64 rsvd2;                    -> unused
> > >};
> > >
> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
> > >of adding even more complexity to an already abused interface? While
> > >the Vulkan-like extension thing is really nice, I don't think what
> > >we're doing here is extending the ioctl usage, we're completely
> > >changing how the base struct should be interpreted based on how the VM
> > >was created (which is an entirely different ioctl).
> > >
> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
> > >already at -6 without these changes. I think after vm_bind we'll need
> > >to create a -11 entry just to deal with this ioctl.
> > >
> >
> > The only change here is removing the execlist support for VM_BIND
> > mode (other than natual extensions).
> > Adding a new execbuffer3 was considered, but I think we need to be careful
> > with that as that goes beyond the VM_BIND support, including any future
> > requirements (as we don't want an execbuffer4 after VM_BIND).
>
> Why not? it's not like adding extensions here is really that different
> than adding new ioctls.
>
> I definitely think this deserves an execbuffer3 without even
> considering future requirements. Just  to burn down the old
> requirements and pointless fields.
>
> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
> older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less
midlayer. If we share the entry point then there's quite a few vfuncs
needed to cleanly split out the vm_bind paths from the legacy
reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl
vfunc, and then we share code (where it even makes sense, probably
request setup/submit need to be shared, anything else is probably
cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like
i915_active on the vma or vma open counts and all that stuff leaks
into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier,
or at least more flexible, since it should make it easier to have the
upstream vm_bind co-exist with all the other things we have. Without
huge amounts of conflicts (or at least much less) that pushing a pile
of vfuncs into the existing code would cause.

So maybe we should do this?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32   ` Niranjana Vishwanathapura
  (?)
  (?)
@ 2022-06-01 14:25   ` Lionel Landwerlin
  2022-06-01 20:28       ` Matthew Brost
  2022-06-01 21:18       ` Matthew Brost
  -1 siblings, 2 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-01 14:25 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, christian.koenig, chris.p.wilson

On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start 
binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's 
not immediate.


I have a question on the behavior of the bind operation when no input 
fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)


In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order


One thing I didn't realize is that because we only get one "VM_BIND" 
engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.


I guess we can deal with that scenario in userspace by doing the wait 
ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.


Daniel : what do you think? Should be rework this or just deal with wait 
fences in userspace?


Sorry I noticed this late.


-Lionel



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 14:25   ` Lionel Landwerlin
@ 2022-06-01 20:28       ` Matthew Brost
  2022-06-01 21:18       ` Matthew Brost
  1 sibling, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 20:28 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in
the case of the i915 likely this is a gem context handle) and binds
ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources
so if a different UMD can live with binds being ordered within the VM
they can use a mode consuming less resources.

Matt

> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-01 20:28       ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 20:28 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in
the case of the i915 likely this is a gem context handle) and binds
ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources
so if a different UMD can live with binds being ordered within the VM
they can use a mode consuming less resources.

Matt

> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 14:25   ` Lionel Landwerlin
@ 2022-06-01 21:18       ` Matthew Brost
  2022-06-01 21:18       ` Matthew Brost
  1 sibling, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
signaled before the exec starts)?

Matt

> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 
> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-01 21:18       ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
signaled before the exec starts)?

Matt

> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 
> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-05-17 18:32   ` Niranjana Vishwanathapura
@ 2022-06-02  2:13     ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-02  2:13 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, intel-gfx, dri-devel, Vetter,  Daniel
  Cc: Brost, Matthew, Hellstrom, Thomas, Wilson, Chris P, jason,
	christian.koenig



Regards,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Niranjana Vishwanathapura
> Sent: May 17, 2022 2:32 PM
> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst   |   2 +
>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  3 files changed, 310 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> api/dma-buf.rst
> index 36a76cbe9095..64cb924ec5bb 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
> 
> +.. _indefinite_dma_fences:
> +
>  Indefinite DMA Fences
>  ~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct
> drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
> an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards,
Oak 


> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)
> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct
> drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.
> +
> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
> not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.
> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
> future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults
> manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page
> faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of
> these
> +locks. There we will simply smash the new batch buffer address into the ring
> and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a
> reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each
> execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
> dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
> the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
> Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the
> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
> opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
> during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence)
> proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to
> wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume
> work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with
> GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier
> execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
> mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely
> managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem
> BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02  2:13     ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-02  2:13 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana, intel-gfx, dri-devel, Vetter,  Daniel
  Cc: Hellstrom, Thomas, Wilson, Chris P, christian.koenig



Regards,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Niranjana Vishwanathapura
> Sent: May 17, 2022 2:32 PM
> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst   |   2 +
>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  3 files changed, 310 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> api/dma-buf.rst
> index 36a76cbe9095..64cb924ec5bb 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
> 
> +.. _indefinite_dma_fences:
> +
>  Indefinite DMA Fences
>  ~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct
> drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
> an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards,
Oak 


> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)
> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct
> drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.
> +
> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
> not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.
> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
> future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults
> manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page
> faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of
> these
> +locks. There we will simply smash the new batch buffer address into the ring
> and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a
> reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each
> execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
> dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
> the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
> Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the
> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
> opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
> during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence)
> proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to
> wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume
> work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with
> GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier
> execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
> mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely
> managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem
> BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-01  9:27           ` Daniel Vetter
@ 2022-06-02  5:08             ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02  5:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>
>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com> wrote:
>> >
>> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> > >> VM_BIND and related uapi definitions
>> > >>
>> > >> v2: Ensure proper kernel-doc formatting with cross references.
>> > >>     Also add new uapi and documentation as per review comments
>> > >>     from Daniel.
>> > >>
>> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> > >> ---
>> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>> > >>  1 file changed, 399 insertions(+)
>> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>> > >>
>> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> new file mode 100644
>> > >> index 000000000000..589c0a009107
>> > >> --- /dev/null
>> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> @@ -0,0 +1,399 @@
>> > >> +/* SPDX-License-Identifier: MIT */
>> > >> +/*
>> > >> + * Copyright © 2022 Intel Corporation
>> > >> + */
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_PARAM_HAS_VM_BIND
>> > >> + *
>> > >> + * VM_BIND feature availability.
>> > >> + * See typedef drm_i915_getparam_t param.
>> > >> + */
>> > >> +#define I915_PARAM_HAS_VM_BIND               57
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> > >> + *
>> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> > >> + * See struct drm_i915_gem_vm_control flags.
>> > >> + *
>> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>> > >> + * to pass in the batch buffer addresses.
>> > >> + *
>> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>> > >> + */
>> > >
>> > >From that description, it seems we have:
>> > >
>> > >struct drm_i915_gem_execbuffer2 {
>> > >        __u64 buffers_ptr;              -> must be 0 (new)
>> > >        __u32 buffer_count;             -> must be 0 (new)
>> > >        __u32 batch_start_offset;       -> must be 0 (new)
>> > >        __u32 batch_len;                -> must be 0 (new)
>> > >        __u32 DR1;                      -> must be 0 (old)
>> > >        __u32 DR4;                      -> must be 0 (old)
>> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>> > >        __u64 flags;                    -> some flags must be 0 (new)
>> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
>> > >        __u64 rsvd2;                    -> unused
>> > >};
>> > >
>> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>> > >of adding even more complexity to an already abused interface? While
>> > >the Vulkan-like extension thing is really nice, I don't think what
>> > >we're doing here is extending the ioctl usage, we're completely
>> > >changing how the base struct should be interpreted based on how the VM
>> > >was created (which is an entirely different ioctl).
>> > >
>> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>> > >already at -6 without these changes. I think after vm_bind we'll need
>> > >to create a -11 entry just to deal with this ioctl.
>> > >
>> >
>> > The only change here is removing the execlist support for VM_BIND
>> > mode (other than natual extensions).
>> > Adding a new execbuffer3 was considered, but I think we need to be careful
>> > with that as that goes beyond the VM_BIND support, including any future
>> > requirements (as we don't want an execbuffer4 after VM_BIND).
>>
>> Why not? it's not like adding extensions here is really that different
>> than adding new ioctls.
>>
>> I definitely think this deserves an execbuffer3 without even
>> considering future requirements. Just  to burn down the old
>> requirements and pointless fields.
>>
>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>> older sw on execbuf2 for ever.
>
>I guess another point in favour of execbuf3 would be that it's less
>midlayer. If we share the entry point then there's quite a few vfuncs
>needed to cleanly split out the vm_bind paths from the legacy
>reloc/softping paths.
>
>If we invert this and do execbuf3, then there's the existing ioctl
>vfunc, and then we share code (where it even makes sense, probably
>request setup/submit need to be shared, anything else is probably
>cleaner to just copypaste) with the usual helper approach.
>
>Also that would guarantee that really none of the old concepts like
>i915_active on the vma or vma open counts and all that stuff leaks
>into the new vm_bind execbuf.
>
>Finally I also think that copypasting would make backporting easier,
>or at least more flexible, since it should make it easier to have the
>upstream vm_bind co-exist with all the other things we have. Without
>huge amounts of conflicts (or at least much less) that pushing a pile
>of vfuncs into the existing code would cause.
>
>So maybe we should do this?

Thanks Dave, Daniel.
There are a few things that will be common between execbuf2 and
execbuf3, like request setup/submit (as you said), fence handling 
(timeline fences, fence array, composite fences), engine selection,
etc. Also, many of the 'flags' will be there in execbuf3 also (but
bit position will differ).
But I guess these should be fine as the suggestion here is to
copy-paste the execbuff code and having a shared code where possible.
Besides, we can stop supporting some older feature in execbuff3
(like fence array in favor of newer timeline fences), which will
further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Niranjana

>-Daniel
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
@ 2022-06-02  5:08             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02  5:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, Dave Airlie, christian.koenig

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>
>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com> wrote:
>> >
>> > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>> > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> > >> VM_BIND and related uapi definitions
>> > >>
>> > >> v2: Ensure proper kernel-doc formatting with cross references.
>> > >>     Also add new uapi and documentation as per review comments
>> > >>     from Daniel.
>> > >>
>> > >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> > >> ---
>> > >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>> > >>  1 file changed, 399 insertions(+)
>> > >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>> > >>
>> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> new file mode 100644
>> > >> index 000000000000..589c0a009107
>> > >> --- /dev/null
>> > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> > >> @@ -0,0 +1,399 @@
>> > >> +/* SPDX-License-Identifier: MIT */
>> > >> +/*
>> > >> + * Copyright © 2022 Intel Corporation
>> > >> + */
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_PARAM_HAS_VM_BIND
>> > >> + *
>> > >> + * VM_BIND feature availability.
>> > >> + * See typedef drm_i915_getparam_t param.
>> > >> + */
>> > >> +#define I915_PARAM_HAS_VM_BIND               57
>> > >> +
>> > >> +/**
>> > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> > >> + *
>> > >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> > >> + * See struct drm_i915_gem_vm_control flags.
>> > >> + *
>> > >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>> > >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>> > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>> > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>> > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>> > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>> > >> + * to pass in the batch buffer addresses.
>> > >> + *
>> > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>> > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>> > >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> > >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>> > >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>> > >> + */
>> > >
>> > >From that description, it seems we have:
>> > >
>> > >struct drm_i915_gem_execbuffer2 {
>> > >        __u64 buffers_ptr;              -> must be 0 (new)
>> > >        __u32 buffer_count;             -> must be 0 (new)
>> > >        __u32 batch_start_offset;       -> must be 0 (new)
>> > >        __u32 batch_len;                -> must be 0 (new)
>> > >        __u32 DR1;                      -> must be 0 (old)
>> > >        __u32 DR4;                      -> must be 0 (old)
>> > >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>> > >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>> > >        __u64 flags;                    -> some flags must be 0 (new)
>> > >        __u64 rsvd1; (context info)     -> repurposed field (old)
>> > >        __u64 rsvd2;                    -> unused
>> > >};
>> > >
>> > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>> > >of adding even more complexity to an already abused interface? While
>> > >the Vulkan-like extension thing is really nice, I don't think what
>> > >we're doing here is extending the ioctl usage, we're completely
>> > >changing how the base struct should be interpreted based on how the VM
>> > >was created (which is an entirely different ioctl).
>> > >
>> > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>> > >already at -6 without these changes. I think after vm_bind we'll need
>> > >to create a -11 entry just to deal with this ioctl.
>> > >
>> >
>> > The only change here is removing the execlist support for VM_BIND
>> > mode (other than natual extensions).
>> > Adding a new execbuffer3 was considered, but I think we need to be careful
>> > with that as that goes beyond the VM_BIND support, including any future
>> > requirements (as we don't want an execbuffer4 after VM_BIND).
>>
>> Why not? it's not like adding extensions here is really that different
>> than adding new ioctls.
>>
>> I definitely think this deserves an execbuffer3 without even
>> considering future requirements. Just  to burn down the old
>> requirements and pointless fields.
>>
>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>> older sw on execbuf2 for ever.
>
>I guess another point in favour of execbuf3 would be that it's less
>midlayer. If we share the entry point then there's quite a few vfuncs
>needed to cleanly split out the vm_bind paths from the legacy
>reloc/softping paths.
>
>If we invert this and do execbuf3, then there's the existing ioctl
>vfunc, and then we share code (where it even makes sense, probably
>request setup/submit need to be shared, anything else is probably
>cleaner to just copypaste) with the usual helper approach.
>
>Also that would guarantee that really none of the old concepts like
>i915_active on the vma or vma open counts and all that stuff leaks
>into the new vm_bind execbuf.
>
>Finally I also think that copypasting would make backporting easier,
>or at least more flexible, since it should make it easier to have the
>upstream vm_bind co-exist with all the other things we have. Without
>huge amounts of conflicts (or at least much less) that pushing a pile
>of vfuncs into the existing code would cause.
>
>So maybe we should do this?

Thanks Dave, Daniel.
There are a few things that will be common between execbuf2 and
execbuf3, like request setup/submit (as you said), fence handling 
(timeline fences, fence array, composite fences), engine selection,
etc. Also, many of the 'flags' will be there in execbuf3 also (but
bit position will differ).
But I guess these should be fine as the suggestion here is to
copy-paste the execbuff code and having a shared code where possible.
Besides, we can stop supporting some older feature in execbuff3
(like fence array in favor of newer timeline fences), which will
further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Niranjana

>-Daniel
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 21:18       ` Matthew Brost
@ 2022-06-02  5:42         ` Lionel Landwerlin
  -1 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-02  5:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On 02/06/2022 00:18, Matthew Brost wrote:
> On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>> +async worker. The binding and unbinding will work like a special GPU engine.
>>> +The binding and unbinding operations are serialized and will wait on specified
>>> +input fences before the operation and will signal the output fences upon the
>>> +completion of the operation. Due to serialization, completion of an operation
>>> +will also indicate that all previous operations are also complete.
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>
>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>
>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
> Question - let's say this done after the above operations:
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>
> Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> signaled before the exec starts)?
>
> Matt


Hi Matt,

 From the vulkan point of view, everything is serialized within an 
engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.


To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


EXEC will wait until fence2 is signaled.
Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel


>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02  5:42         ` Lionel Landwerlin
  0 siblings, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-02  5:42 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On 02/06/2022 00:18, Matthew Brost wrote:
> On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>> +async worker. The binding and unbinding will work like a special GPU engine.
>>> +The binding and unbinding operations are serialized and will wait on specified
>>> +input fences before the operation and will signal the output fences upon the
>>> +completion of the operation. Due to serialization, completion of an operation
>>> +will also indicate that all previous operations are also complete.
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>
>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>
>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
> Question - let's say this done after the above operations:
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>
> Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> signaled before the exec starts)?
>
> Matt


Hi Matt,

 From the vulkan point of view, everything is serialized within an 
engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.


To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


EXEC will wait until fence2 is signaled.
Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel


>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02  5:42         ` Lionel Landwerlin
@ 2022-06-02 16:22           ` Matthew Brost
  -1 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-02 16:22 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, Niranjana Vishwanathapura, christian.koenig

On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > > > +async worker. The binding and unbinding will work like a special GPU engine.
> > > > +The binding and unbinding operations are serialized and will wait on specified
> > > > +input fences before the operation and will signal the output fences upon the
> > > > +completion of the operation. Due to serialization, completion of an operation
> > > > +will also indicate that all previous operations are also complete.
> > > I guess we should avoid saying "will immediately start binding/unbinding" if
> > > there are fences involved.
> > > 
> > > And the fact that it's happening in an async worker seem to imply it's not
> > > immediate.
> > > 
> > > 
> > > I have a question on the behavior of the bind operation when no input fence
> > > is provided. Let say I do :
> > > 
> > > VM_BIND (out_fence=fence1)
> > > 
> > > VM_BIND (out_fence=fence2)
> > > 
> > > VM_BIND (out_fence=fence3)
> > > 
> > > 
> > > In what order are the fences going to be signaled?
> > > 
> > > In the order of VM_BIND ioctls? Or out of order?
> > > 
> > > Because you wrote "serialized I assume it's : in order
> > > 
> > > 
> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
> > > there is a disconnect from the Vulkan specification.
> > > 
> > > In Vulkan VM_BIND operations are serialized but per engine.
> > > 
> > > So you could have something like this :
> > > 
> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> > > 
> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> > > 
> > Question - let's say this done after the above operations:
> > 
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> > 
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> > 
> > Matt
> 
> 
> Hi Matt,
> 
> From the vulkan point of view, everything is serialized within an engine (we
> map that to a VkQueue).
> 
> So with :
> 
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> EXEC completes first then VM_BIND executes.
> 
> 
> To be even clearer :
> 
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
> 
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
> 

Yea this makes sense. I think of VM_BINDs as more or less just another
version of an EXEC and this fits with that.

In practice I don't think we can share a ring but we should be able to
present an engine (again likely a gem context in i915) to the user that
orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

Hopefully Niranjana + Daniel agree.

Matt

> -Lionel
> 
> 
> > 
> > > fence1 is not signaled
> > > 
> > > fence3 is signaled
> > > 
> > > So the second VM_BIND will proceed before the first VM_BIND.
> > > 
> > > 
> > > I guess we can deal with that scenario in userspace by doing the wait
> > > ourselves in one thread per engines.
> > > 
> > > But then it makes the VM_BIND input fences useless.
> > > 
> > > 
> > > Daniel : what do you think? Should be rework this or just deal with wait
> > > fences in userspace?
> > > 
> > > 
> > > Sorry I noticed this late.
> > > 
> > > 
> > > -Lionel
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 16:22           ` Matthew Brost
  0 siblings, 0 replies; 121+ messages in thread
From: Matthew Brost @ 2022-06-02 16:22 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: intel-gfx, dri-devel, thomas.hellstrom, chris.p.wilson,
	daniel.vetter, christian.koenig

On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > > > +async worker. The binding and unbinding will work like a special GPU engine.
> > > > +The binding and unbinding operations are serialized and will wait on specified
> > > > +input fences before the operation and will signal the output fences upon the
> > > > +completion of the operation. Due to serialization, completion of an operation
> > > > +will also indicate that all previous operations are also complete.
> > > I guess we should avoid saying "will immediately start binding/unbinding" if
> > > there are fences involved.
> > > 
> > > And the fact that it's happening in an async worker seem to imply it's not
> > > immediate.
> > > 
> > > 
> > > I have a question on the behavior of the bind operation when no input fence
> > > is provided. Let say I do :
> > > 
> > > VM_BIND (out_fence=fence1)
> > > 
> > > VM_BIND (out_fence=fence2)
> > > 
> > > VM_BIND (out_fence=fence3)
> > > 
> > > 
> > > In what order are the fences going to be signaled?
> > > 
> > > In the order of VM_BIND ioctls? Or out of order?
> > > 
> > > Because you wrote "serialized I assume it's : in order
> > > 
> > > 
> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
> > > there is a disconnect from the Vulkan specification.
> > > 
> > > In Vulkan VM_BIND operations are serialized but per engine.
> > > 
> > > So you could have something like this :
> > > 
> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> > > 
> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> > > 
> > Question - let's say this done after the above operations:
> > 
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> > 
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> > 
> > Matt
> 
> 
> Hi Matt,
> 
> From the vulkan point of view, everything is serialized within an engine (we
> map that to a VkQueue).
> 
> So with :
> 
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> EXEC completes first then VM_BIND executes.
> 
> 
> To be even clearer :
> 
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
> 
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
> 

Yea this makes sense. I think of VM_BINDs as more or less just another
version of an EXEC and this fits with that.

In practice I don't think we can share a ring but we should be able to
present an engine (again likely a gem context in i915) to the user that
orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

Hopefully Niranjana + Daniel agree.

Matt

> -Lionel
> 
> 
> > 
> > > fence1 is not signaled
> > > 
> > > fence3 is signaled
> > > 
> > > So the second VM_BIND will proceed before the first VM_BIND.
> > > 
> > > 
> > > I guess we can deal with that scenario in userspace by doing the wait
> > > ourselves in one thread per engines.
> > > 
> > > But then it makes the VM_BIND input fences useless.
> > > 
> > > 
> > > Daniel : what do you think? Should be rework this or just deal with wait
> > > fences in userspace?
> > > 
> > > 
> > > Sorry I noticed this late.
> > > 
> > > 
> > > -Lionel
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-01 20:28       ` Matthew Brost
@ 2022-06-02 20:11         ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	Lionel Landwerlin, daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a special GPU engine.
>> > +The binding and unbinding operations are serialized and will wait on specified
>> > +input fences before the operation and will signal the output fences upon the
>> > +completion of the operation. Due to serialization, completion of an operation
>> > +will also indicate that all previous operations are also complete.
>>
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred until next execbuff.
But now it is non-deferred (immediate in that sense). But yah, this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in the order of submission.
User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these
queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it
is really helping with the implementation.

Daniel, any thoughts?

Niranjana

>Matt
>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:11         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:11 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	daniel.vetter, christian.koenig

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a special GPU engine.
>> > +The binding and unbinding operations are serialized and will wait on specified
>> > +input fences before the operation and will signal the output fences upon the
>> > +completion of the operation. Due to serialization, completion of an operation
>> > +will also indicate that all previous operations are also complete.
>>
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred until next execbuff.
But now it is non-deferred (immediate in that sense). But yah, this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in the order of submission.
User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these
queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it
is really helping with the implementation.

Daniel, any thoughts?

Niranjana

>Matt
>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02  5:42         ` Lionel Landwerlin
@ 2022-06-02 20:16           ` Bas Nieuwenhuizen
  -1 siblings, 0 replies; 121+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-02 20:16 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Matthew Brost, Intel Graphics Development, chris.p.wilson,
	Thomas Hellström, ML dri-devel, Daniel Vetter,
	Niranjana Vishwanathapura, Koenig, Christian

On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin
<lionel.g.landwerlin@intel.com> wrote:
>
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> >>> +async worker. The binding and unbinding will work like a special GPU engine.
> >>> +The binding and unbinding operations are serialized and will wait on specified
> >>> +input fences before the operation and will signal the output fences upon the
> >>> +completion of the operation. Due to serialization, completion of an operation
> >>> +will also indicate that all previous operations are also complete.
> >> I guess we should avoid saying "will immediately start binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's not
> >> immediate.
> >>
> >>
> >> I have a question on the behavior of the bind operation when no input fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> >> there is a disconnect from the Vulkan specification.

Note that in Vulkan not every queue has to support sparse binding, so
one could consider a dedicated sparse binding only queue family.

> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> > Question - let's say this done after the above operations:
> >
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> >
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> >
> > Matt
>
>
> Hi Matt,
>
>  From the vulkan point of view, everything is serialized within an
> engine (we map that to a VkQueue).
>
> So with :
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
> EXEC completes first then VM_BIND executes.
>
>
> To be even clearer :
>
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
>
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>
> -Lionel
>
>
> >
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.

I posed the same question on my series for AMD
(https://patchwork.freedesktop.org/series/104578/), albeit for
slightly different reasons.: if one creates a new VkMemory object, you
generally want that mapped ASAP, as you can't track (in a
VK_KHR_descriptor_indexing world) whether the next submit is going to
use this VkMemory object and hence have to assume the worst (i.e. wait
till the map/bind is complete before executing the next submission).
If all binds/unbinds (or maps/unmaps) happen in-order that means an
operation with input fences could delay stuff we want ASAP.

Of course waiting in userspace does have disadvantages:

1) more overhead between fence signalling and the operation,
potentially causing slightly bigger GPU bubbles.
2) You can't get an out fence early. Within the driver we can mostly
work around this but sync_fd exports, WSI and such will be messy.
3) moving the queue to a thread might make things slightly less ideal
due to scheduling delays.

Removing the in-order working in the kernel generally seems like
madness to me as it is very hard to keep track of the state of the
virtual address space (to e.g. track umapping stuff before freeing
memory or moving memory around)

the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:

separate queue:
1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore
2) sparse bind with input semaphore from 1 and 1 output semaphore
3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence
4) wait on that fence on the CPU

which works very well if we just wait for the sparse bind input
semaphore in userspace, but I'm still working on seeing if this is the
common usecase or an outlier.



> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:16           ` Bas Nieuwenhuizen
  0 siblings, 0 replies; 121+ messages in thread
From: Bas Nieuwenhuizen @ 2022-06-02 20:16 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel Graphics Development, chris.p.wilson,
	Thomas Hellström, ML dri-devel, Daniel Vetter, Koenig,
	Christian

On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin
<lionel.g.landwerlin@intel.com> wrote:
>
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> >>> +async worker. The binding and unbinding will work like a special GPU engine.
> >>> +The binding and unbinding operations are serialized and will wait on specified
> >>> +input fences before the operation and will signal the output fences upon the
> >>> +completion of the operation. Due to serialization, completion of an operation
> >>> +will also indicate that all previous operations are also complete.
> >> I guess we should avoid saying "will immediately start binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's not
> >> immediate.
> >>
> >>
> >> I have a question on the behavior of the bind operation when no input fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> >> there is a disconnect from the Vulkan specification.

Note that in Vulkan not every queue has to support sparse binding, so
one could consider a dedicated sparse binding only queue family.

> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> > Question - let's say this done after the above operations:
> >
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> >
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> >
> > Matt
>
>
> Hi Matt,
>
>  From the vulkan point of view, everything is serialized within an
> engine (we map that to a VkQueue).
>
> So with :
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
> EXEC completes first then VM_BIND executes.
>
>
> To be even clearer :
>
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
>
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>
> -Lionel
>
>
> >
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.

I posed the same question on my series for AMD
(https://patchwork.freedesktop.org/series/104578/), albeit for
slightly different reasons.: if one creates a new VkMemory object, you
generally want that mapped ASAP, as you can't track (in a
VK_KHR_descriptor_indexing world) whether the next submit is going to
use this VkMemory object and hence have to assume the worst (i.e. wait
till the map/bind is complete before executing the next submission).
If all binds/unbinds (or maps/unmaps) happen in-order that means an
operation with input fences could delay stuff we want ASAP.

Of course waiting in userspace does have disadvantages:

1) more overhead between fence signalling and the operation,
potentially causing slightly bigger GPU bubbles.
2) You can't get an out fence early. Within the driver we can mostly
work around this but sync_fd exports, WSI and such will be messy.
3) moving the queue to a thread might make things slightly less ideal
due to scheduling delays.

Removing the in-order working in the kernel generally seems like
madness to me as it is very hard to keep track of the state of the
virtual address space (to e.g. track umapping stuff before freeing
memory or moving memory around)

the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:

separate queue:
1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore
2) sparse bind with input semaphore from 1 and 1 output semaphore
3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence
4) wait on that fence on the CPU

which works very well if we just wait for the sparse bind input
semaphore in userspace, but I'm still working on seeing if this is the
common usecase or an outlier.



> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 16:22           ` Matthew Brost
@ 2022-06-02 20:24             ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:24 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	Lionel Landwerlin, daniel.vetter, christian.koenig

On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:
>On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
>> On 02/06/2022 00:18, Matthew Brost wrote:
>> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > > > +async worker. The binding and unbinding will work like a special GPU engine.
>> > > > +The binding and unbinding operations are serialized and will wait on specified
>> > > > +input fences before the operation and will signal the output fences upon the
>> > > > +completion of the operation. Due to serialization, completion of an operation
>> > > > +will also indicate that all previous operations are also complete.
>> > > I guess we should avoid saying "will immediately start binding/unbinding" if
>> > > there are fences involved.
>> > >
>> > > And the fact that it's happening in an async worker seem to imply it's not
>> > > immediate.
>> > >
>> > >
>> > > I have a question on the behavior of the bind operation when no input fence
>> > > is provided. Let say I do :
>> > >
>> > > VM_BIND (out_fence=fence1)
>> > >
>> > > VM_BIND (out_fence=fence2)
>> > >
>> > > VM_BIND (out_fence=fence3)
>> > >
>> > >
>> > > In what order are the fences going to be signaled?
>> > >
>> > > In the order of VM_BIND ioctls? Or out of order?
>> > >
>> > > Because you wrote "serialized I assume it's : in order
>> > >
>> > >
>> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> > > there is a disconnect from the Vulkan specification.
>> > >
>> > > In Vulkan VM_BIND operations are serialized but per engine.
>> > >
>> > > So you could have something like this :
>> > >
>> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>> > >
>> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>> > >
>> > Question - let's say this done after the above operations:
>> >
>> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> >
>> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
>> > signaled before the exec starts)?
>> >
>> > Matt
>>
>>
>> Hi Matt,
>>
>> From the vulkan point of view, everything is serialized within an engine (we
>> map that to a VkQueue).
>>
>> So with :
>>
>> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>> EXEC completes first then VM_BIND executes.
>>
>>
>> To be even clearer :
>>
>> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> EXEC will wait until fence2 is signaled.
>> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>>
>> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>>
>
>Yea this makes sense. I think of VM_BINDs as more or less just another
>version of an EXEC and this fits with that.
>

Note that VM_BIND itself can bind while and EXEC (GPU job) is running.
(Say, getting binds ready for next submission). It is up to user though,
how to use it.

>In practice I don't think we can share a ring but we should be able to
>present an engine (again likely a gem context in i915) to the user that
>orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
>

I have responded in the other thread on this.

Niranjana

>Hopefully Niranjana + Daniel agree.
>
>Matt
>
>> -Lionel
>>
>>
>> >
>> > > fence1 is not signaled
>> > >
>> > > fence3 is signaled
>> > >
>> > > So the second VM_BIND will proceed before the first VM_BIND.
>> > >
>> > >
>> > > I guess we can deal with that scenario in userspace by doing the wait
>> > > ourselves in one thread per engines.
>> > >
>> > > But then it makes the VM_BIND input fences useless.
>> > >
>> > >
>> > > Daniel : what do you think? Should be rework this or just deal with wait
>> > > fences in userspace?
>> > >
>> > >
>> > > Sorry I noticed this late.
>> > >
>> > >
>> > > -Lionel
>> > >
>> > >
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:24             ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:24 UTC (permalink / raw)
  To: Matthew Brost
  Cc: chris.p.wilson, intel-gfx, dri-devel, thomas.hellstrom,
	daniel.vetter, christian.koenig

On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:
>On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
>> On 02/06/2022 00:18, Matthew Brost wrote:
>> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > > > +async worker. The binding and unbinding will work like a special GPU engine.
>> > > > +The binding and unbinding operations are serialized and will wait on specified
>> > > > +input fences before the operation and will signal the output fences upon the
>> > > > +completion of the operation. Due to serialization, completion of an operation
>> > > > +will also indicate that all previous operations are also complete.
>> > > I guess we should avoid saying "will immediately start binding/unbinding" if
>> > > there are fences involved.
>> > >
>> > > And the fact that it's happening in an async worker seem to imply it's not
>> > > immediate.
>> > >
>> > >
>> > > I have a question on the behavior of the bind operation when no input fence
>> > > is provided. Let say I do :
>> > >
>> > > VM_BIND (out_fence=fence1)
>> > >
>> > > VM_BIND (out_fence=fence2)
>> > >
>> > > VM_BIND (out_fence=fence3)
>> > >
>> > >
>> > > In what order are the fences going to be signaled?
>> > >
>> > > In the order of VM_BIND ioctls? Or out of order?
>> > >
>> > > Because you wrote "serialized I assume it's : in order
>> > >
>> > >
>> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> > > there is a disconnect from the Vulkan specification.
>> > >
>> > > In Vulkan VM_BIND operations are serialized but per engine.
>> > >
>> > > So you could have something like this :
>> > >
>> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>> > >
>> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>> > >
>> > Question - let's say this done after the above operations:
>> >
>> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> >
>> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
>> > signaled before the exec starts)?
>> >
>> > Matt
>>
>>
>> Hi Matt,
>>
>> From the vulkan point of view, everything is serialized within an engine (we
>> map that to a VkQueue).
>>
>> So with :
>>
>> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>> EXEC completes first then VM_BIND executes.
>>
>>
>> To be even clearer :
>>
>> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> EXEC will wait until fence2 is signaled.
>> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>>
>> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>>
>
>Yea this makes sense. I think of VM_BINDs as more or less just another
>version of an EXEC and this fits with that.
>

Note that VM_BIND itself can bind while and EXEC (GPU job) is running.
(Say, getting binds ready for next submission). It is up to user though,
how to use it.

>In practice I don't think we can share a ring but we should be able to
>present an engine (again likely a gem context in i915) to the user that
>orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
>

I have responded in the other thread on this.

Niranjana

>Hopefully Niranjana + Daniel agree.
>
>Matt
>
>> -Lionel
>>
>>
>> >
>> > > fence1 is not signaled
>> > >
>> > > fence3 is signaled
>> > >
>> > > So the second VM_BIND will proceed before the first VM_BIND.
>> > >
>> > >
>> > > I guess we can deal with that scenario in userspace by doing the wait
>> > > ourselves in one thread per engines.
>> > >
>> > > But then it makes the VM_BIND input fences useless.
>> > >
>> > >
>> > > Daniel : what do you think? Should be rework this or just deal with wait
>> > > fences in userspace?
>> > >
>> > >
>> > > Sorry I noticed this late.
>> > >
>> > >
>> > > -Lionel
>> > >
>> > >
>>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 20:11         ` Niranjana Vishwanathapura
@ 2022-06-02 20:35           ` Jason Ekstrand
  -1 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-02 20:35 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Matthew Brost, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Chris Wilson, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 7866 bytes --]

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in an
> >> > +async worker. The binding and unbinding will work like a special GPU
> engine.
> >> > +The binding and unbinding operations are serialized and will wait on
> specified
> >> > +input fences before the operation and will signal the output fences
> upon the
> >> > +completion of the operation. Due to serialization, completion of an
> operation
> >> > +will also indicate that all previous operations are also complete.
> >>
> >> I guess we should avoid saying "will immediately start
> binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's
> not
> >> immediate.
> >>
>
> Ok, will fix.
> This was added because in earlier design binding was deferred until next
> execbuff.
> But now it is non-deferred (immediate in that sense). But yah, this is
> confusing
> and will fix it.
>
> >>
> >> I have a question on the behavior of the bind operation when no input
> fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
>
> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will
> use
> the same queue and hence are ordered.
>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND"
> engine,
> >> there is a disconnect from the Vulkan specification.
> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> >>
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.
> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >
> >My opinion is rework this but make the ordering via an engine param
> optional.
> >
> >e.g. A VM can be configured so all binds are ordered within the VM
> >
> >e.g. A VM can be configured so all binds accept an engine argument (in
> >the case of the i915 likely this is a gem context handle) and binds
> >ordered with respect to that engine.
> >
> >This gives UMDs options as the later likely consumes more KMD resources
> >so if a different UMD can live with binds being ordered within the VM
> >they can use a mode consuming less resources.
> >
>
> I think we need to be careful here if we are looking for some out of
> (submission) order completion of vm_bind/unbind.
> In-order completion means, in a batch of binds and unbinds to be
> completed in-order, user only needs to specify in-fence for the
> first bind/unbind call and the our-fence for the last bind/unbind
> call. Also, the VA released by an unbind call can be re-used by
> any subsequent bind call in that in-order batch.
>
> These things will break if binding/unbinding were to be allowed to
> go out of order (of submission) and user need to be extra careful
> not to run into pre-mature triggereing of out-fence and bind failing
> as VA is still in use etc.
>
> Also, VM_BIND binds the provided mapping on the specified address space
> (VM). So, the uapi is not engine/context specific.
>
> We can however add a 'queue' to the uapi which can be one from the
> pre-defined queues,
> I915_VM_BIND_QUEUE_0
> I915_VM_BIND_QUEUE_1
> ...
> I915_VM_BIND_QUEUE_(N-1)
>
> KMD will spawn an async work queue for each queue which will only
> bind the mappings on that queue in the order of submission.
> User can assign the queue to per engine or anything like that.
>
> But again here, user need to be careful and not deadlock these
> queues with circular dependency of fences.
>
> I prefer adding this later an as extension based on whether it
> is really helping with the implementation.
>

I can tell you right now that having everything on a single in-order queue
will not get us the perf we want.  What vulkan really wants is one of two
things:

 1. No implicit ordering of VM_BIND ops.  They just happen in whatever
their dependencies are resolved and we ensure ordering ourselves by having
a syncobj in the VkQueue.

 2. The ability to create multiple VM_BIND queues.  We need at least 2 but
I don't see why there needs to be a limit besides the limits the i915 API
already has on the number of engines.  Vulkan could expose multiple sparse
binding queues to the client if it's not arbitrarily limited.

Why?  Because Vulkan has two basic kind of bind operations and we don't
want any dependencies between them:

 1. Immediate.  These happen right after BO creation or maybe as part of
vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a queue
and we don't want them serialized with anything.  To synchronize with
submit, we'll have a syncobj in the VkDevice which is signaled by all
immediate bind operations and make submits wait on it.

 2. Queued (sparse): These happen on a VkQueue which may be the same as a
render/compute queue or may be its own queue.  It's up to us what we want
to advertise.  From the Vulkan API PoV, this is like any other queue.
Operations on it wait on and signal semaphores.  If we have a VM_BIND
engine, we'd provide syncobjs to wait and signal just like we do in
execbuf().

The important thing is that we don't want one type of operation to block on
the other.  If immediate binds are blocking on sparse binds, it's going to
cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a
lock on the VM and that we can't actually do these things in parallel.
That's fine.  Once the dma_fences have signaled and we're unblocked to do
the bind operation, I don't care if there's a bit of synchronization due to
locking.  That's expected.  What we can't afford to have is an immediate
bind operation suddenly blocking on a sparse operation which is blocked on
a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging
queues (what they call a VM_BIND engine/queue).  That design works pretty
well and solves the problems in question.  Again, we could just make
everything out-of-order and require using syncobjs to order things as
userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND
queues waiting for syncobjs to materialize.  We don't really want/need
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and that
machinery is on by default.  It would actually take MORE work in Mesa to
turn it off and take advantage of the kernel being able to wait for
syncobjs to materialize.  Also, getting that right is ridiculously hard and
I really don't want to get it wrong in kernel space.  When we do memory
fences, wait-before-signal will be a thing.  We don't need to try and make
it a thing for syncobj.

--Jason


> Daniel, any thoughts?
>
> Niranjana
>
> >Matt
> >
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

[-- Attachment #2: Type: text/html, Size: 9607 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:35           ` Jason Ekstrand
  0 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-02 20:35 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

[-- Attachment #1: Type: text/plain, Size: 7866 bytes --]

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in an
> >> > +async worker. The binding and unbinding will work like a special GPU
> engine.
> >> > +The binding and unbinding operations are serialized and will wait on
> specified
> >> > +input fences before the operation and will signal the output fences
> upon the
> >> > +completion of the operation. Due to serialization, completion of an
> operation
> >> > +will also indicate that all previous operations are also complete.
> >>
> >> I guess we should avoid saying "will immediately start
> binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's
> not
> >> immediate.
> >>
>
> Ok, will fix.
> This was added because in earlier design binding was deferred until next
> execbuff.
> But now it is non-deferred (immediate in that sense). But yah, this is
> confusing
> and will fix it.
>
> >>
> >> I have a question on the behavior of the bind operation when no input
> fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
>
> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will
> use
> the same queue and hence are ordered.
>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND"
> engine,
> >> there is a disconnect from the Vulkan specification.
> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> >>
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.
> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >
> >My opinion is rework this but make the ordering via an engine param
> optional.
> >
> >e.g. A VM can be configured so all binds are ordered within the VM
> >
> >e.g. A VM can be configured so all binds accept an engine argument (in
> >the case of the i915 likely this is a gem context handle) and binds
> >ordered with respect to that engine.
> >
> >This gives UMDs options as the later likely consumes more KMD resources
> >so if a different UMD can live with binds being ordered within the VM
> >they can use a mode consuming less resources.
> >
>
> I think we need to be careful here if we are looking for some out of
> (submission) order completion of vm_bind/unbind.
> In-order completion means, in a batch of binds and unbinds to be
> completed in-order, user only needs to specify in-fence for the
> first bind/unbind call and the our-fence for the last bind/unbind
> call. Also, the VA released by an unbind call can be re-used by
> any subsequent bind call in that in-order batch.
>
> These things will break if binding/unbinding were to be allowed to
> go out of order (of submission) and user need to be extra careful
> not to run into pre-mature triggereing of out-fence and bind failing
> as VA is still in use etc.
>
> Also, VM_BIND binds the provided mapping on the specified address space
> (VM). So, the uapi is not engine/context specific.
>
> We can however add a 'queue' to the uapi which can be one from the
> pre-defined queues,
> I915_VM_BIND_QUEUE_0
> I915_VM_BIND_QUEUE_1
> ...
> I915_VM_BIND_QUEUE_(N-1)
>
> KMD will spawn an async work queue for each queue which will only
> bind the mappings on that queue in the order of submission.
> User can assign the queue to per engine or anything like that.
>
> But again here, user need to be careful and not deadlock these
> queues with circular dependency of fences.
>
> I prefer adding this later an as extension based on whether it
> is really helping with the implementation.
>

I can tell you right now that having everything on a single in-order queue
will not get us the perf we want.  What vulkan really wants is one of two
things:

 1. No implicit ordering of VM_BIND ops.  They just happen in whatever
their dependencies are resolved and we ensure ordering ourselves by having
a syncobj in the VkQueue.

 2. The ability to create multiple VM_BIND queues.  We need at least 2 but
I don't see why there needs to be a limit besides the limits the i915 API
already has on the number of engines.  Vulkan could expose multiple sparse
binding queues to the client if it's not arbitrarily limited.

Why?  Because Vulkan has two basic kind of bind operations and we don't
want any dependencies between them:

 1. Immediate.  These happen right after BO creation or maybe as part of
vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a queue
and we don't want them serialized with anything.  To synchronize with
submit, we'll have a syncobj in the VkDevice which is signaled by all
immediate bind operations and make submits wait on it.

 2. Queued (sparse): These happen on a VkQueue which may be the same as a
render/compute queue or may be its own queue.  It's up to us what we want
to advertise.  From the Vulkan API PoV, this is like any other queue.
Operations on it wait on and signal semaphores.  If we have a VM_BIND
engine, we'd provide syncobjs to wait and signal just like we do in
execbuf().

The important thing is that we don't want one type of operation to block on
the other.  If immediate binds are blocking on sparse binds, it's going to
cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a
lock on the VM and that we can't actually do these things in parallel.
That's fine.  Once the dma_fences have signaled and we're unblocked to do
the bind operation, I don't care if there's a bit of synchronization due to
locking.  That's expected.  What we can't afford to have is an immediate
bind operation suddenly blocking on a sparse operation which is blocked on
a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging
queues (what they call a VM_BIND engine/queue).  That design works pretty
well and solves the problems in question.  Again, we could just make
everything out-of-order and require using syncobjs to order things as
userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND
queues waiting for syncobjs to materialize.  We don't really want/need
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and that
machinery is on by default.  It would actually take MORE work in Mesa to
turn it off and take advantage of the kernel being able to wait for
syncobjs to materialize.  Also, getting that right is ridiculously hard and
I really don't want to get it wrong in kernel space.  When we do memory
fences, wait-before-signal will be a thing.  We don't need to try and make
it a thing for syncobj.

--Jason


> Daniel, any thoughts?
>
> Niranjana
>
> >Matt
> >
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>

[-- Attachment #2: Type: text/html, Size: 9607 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02  2:13     ` [Intel-gfx] " Zeng, Oak
@ 2022-06-02 20:48       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:48 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Brost, Matthew, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, jason, Vetter, Daniel, christian.koenig

On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Niranjana Vishwanathapura
>> Sent: May 17, 2022 2:32 PM
>> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
>> Daniel <daniel.vetter@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
>> <chris.p.wilson@intel.com>; christian.koenig@amd.com
>> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
>>
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/driver-api/dma-buf.rst   |   2 +
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
>> +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  3 files changed, 310 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
>> api/dma-buf.rst
>> index 36a76cbe9095..64cb924ec5bb 100644
>> --- a/Documentation/driver-api/dma-buf.rst
>> +++ b/Documentation/driver-api/dma-buf.rst
>> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>>  .. kernel-doc:: include/linux/sync_file.h
>>     :internal:
>>
>> +.. _indefinite_dma_fences:
>> +
>>  Indefinite DMA Fences
>>  ~~~~~~~~~~~~~~~~~~~~~
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
>> b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
>> buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct
>> drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
>> extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
>> an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>
>Hi,
>
>Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
>Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
>

Thanks Oak,
Either should be fine and up to user how to use vm_bind/unbind out-fence.

>I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
>Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
>Restriction since vm bind can be finished in the fault handler?
>

With GPU page faults handler, out fence won't be needed as residency is
purely managed by page fault handler populating page tables (there is a
mention of it in GPU Page Faults section below).

>Should we document such thing?
>

We don't talk much about GPU page faults case in this document as that may
warrent a separate rfc when we add page faults support. We did mention it
in couple places to ensure our locking design here is extensible to gpu
page faults case.

Niranjana

>Regards,
>Oak
>
>
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct
>> drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>> +
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
>> not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
>> during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
>> future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults
>> manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page
>> faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of
>> these
>> +locks. There we will simply smash the new batch buffer address into the ring
>> and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a
>> reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each
>> execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
>> prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
>> dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
>> the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
>> Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware
>> performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the
>> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
>> completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
>> opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
>> during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence)
>> proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to
>> wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume
>> work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with
>> GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier
>> execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
>> mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely
>> managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem
>> BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
>> feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
>> using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-02 20:48       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-02 20:48 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	Daniel, christian.koenig

On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Niranjana Vishwanathapura
>> Sent: May 17, 2022 2:32 PM
>> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
>> Daniel <daniel.vetter@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
>> <chris.p.wilson@intel.com>; christian.koenig@amd.com
>> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
>>
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/driver-api/dma-buf.rst   |   2 +
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
>> +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  3 files changed, 310 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
>> api/dma-buf.rst
>> index 36a76cbe9095..64cb924ec5bb 100644
>> --- a/Documentation/driver-api/dma-buf.rst
>> +++ b/Documentation/driver-api/dma-buf.rst
>> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>>  .. kernel-doc:: include/linux/sync_file.h
>>     :internal:
>>
>> +.. _indefinite_dma_fences:
>> +
>>  Indefinite DMA Fences
>>  ~~~~~~~~~~~~~~~~~~~~~
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
>> b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
>> buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct
>> drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
>> extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
>> an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>
>Hi,
>
>Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
>Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
>

Thanks Oak,
Either should be fine and up to user how to use vm_bind/unbind out-fence.

>I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
>Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
>Restriction since vm bind can be finished in the fault handler?
>

With GPU page faults handler, out fence won't be needed as residency is
purely managed by page fault handler populating page tables (there is a
mention of it in GPU Page Faults section below).

>Should we document such thing?
>

We don't talk much about GPU page faults case in this document as that may
warrent a separate rfc when we add page faults support. We did mention it
in couple places to ensure our locking design here is extensible to gpu
page faults case.

Niranjana

>Regards,
>Oak
>
>
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct
>> drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>> +
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
>> not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
>> during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
>> future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults
>> manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page
>> faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of
>> these
>> +locks. There we will simply smash the new batch buffer address into the ring
>> and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a
>> reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each
>> execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
>> prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
>> dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
>> the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
>> Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware
>> performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the
>> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
>> completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
>> opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
>> during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence)
>> proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to
>> wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume
>> work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with
>> GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier
>> execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
>> mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely
>> managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem
>> BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
>> feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
>> using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-02  5:08             ` Niranjana Vishwanathapura
  (?)
@ 2022-06-03  6:53             ` Niranjana Vishwanathapura
  2022-06-07 10:42               ` Tvrtko Ursulin
                                 ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-03  6:53 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>
>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>> >> VM_BIND and related uapi definitions
>>>> >>
>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>> >>     Also add new uapi and documentation as per review comments
>>>> >>     from Daniel.
>>>> >>
>>>> >> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>> >> ---
>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>> >>  1 file changed, 399 insertions(+)
>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>> >>
>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>>>> >> new file mode 100644
>>>> >> index 000000000000..589c0a009107
>>>> >> --- /dev/null
>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>> >> @@ -0,0 +1,399 @@
>>>> >> +/* SPDX-License-Identifier: MIT */
>>>> >> +/*
>>>> >> + * Copyright © 2022 Intel Corporation
>>>> >> + */
>>>> >> +
>>>> >> +/**
>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>> >> + *
>>>> >> + * VM_BIND feature availability.
>>>> >> + * See typedef drm_i915_getparam_t param.
>>>> >> + */
>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>> >> +
>>>> >> +/**
>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>> >> + *
>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>> >> + *
>>>> >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>>>> >> + * to pass in the batch buffer addresses.
>>>> >> + *
>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>>> >> + */
>>>> >
>>>> >From that description, it seems we have:
>>>> >
>>>> >struct drm_i915_gem_execbuffer2 {
>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions
>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer!
>>>> >        __u64 flags;                    -> some flags must be 0 (new)
>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>> >        __u64 rsvd2;                    -> unused
>>>> >};
>>>> >
>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>>>> >of adding even more complexity to an already abused interface? While
>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>> >we're doing here is extending the ioctl usage, we're completely
>>>> >changing how the base struct should be interpreted based on how the VM
>>>> >was created (which is an entirely different ioctl).
>>>> >
>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>> >already at -6 without these changes. I think after vm_bind we'll need
>>>> >to create a -11 entry just to deal with this ioctl.
>>>> >
>>>>
>>>> The only change here is removing the execlist support for VM_BIND
>>>> mode (other than natual extensions).
>>>> Adding a new execbuffer3 was considered, but I think we need to be careful
>>>> with that as that goes beyond the VM_BIND support, including any future
>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>
>>>Why not? it's not like adding extensions here is really that different
>>>than adding new ioctls.
>>>
>>>I definitely think this deserves an execbuffer3 without even
>>>considering future requirements. Just  to burn down the old
>>>requirements and pointless fields.
>>>
>>>Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>older sw on execbuf2 for ever.
>>
>>I guess another point in favour of execbuf3 would be that it's less
>>midlayer. If we share the entry point then there's quite a few vfuncs
>>needed to cleanly split out the vm_bind paths from the legacy
>>reloc/softping paths.
>>
>>If we invert this and do execbuf3, then there's the existing ioctl
>>vfunc, and then we share code (where it even makes sense, probably
>>request setup/submit need to be shared, anything else is probably
>>cleaner to just copypaste) with the usual helper approach.
>>
>>Also that would guarantee that really none of the old concepts like
>>i915_active on the vma or vma open counts and all that stuff leaks
>>into the new vm_bind execbuf.
>>
>>Finally I also think that copypasting would make backporting easier,
>>or at least more flexible, since it should make it easier to have the
>>upstream vm_bind co-exist with all the other things we have. Without
>>huge amounts of conflicts (or at least much less) that pushing a pile
>>of vfuncs into the existing code would cause.
>>
>>So maybe we should do this?
>
>Thanks Dave, Daniel.
>There are a few things that will be common between execbuf2 and
>execbuf3, like request setup/submit (as you said), fence handling 
>(timeline fences, fence array, composite fences), engine selection,
>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>bit position will differ).
>But I guess these should be fine as the suggestion here is to
>copy-paste the execbuff code and having a shared code where possible.
>Besides, we can stop supporting some older feature in execbuff3
>(like fence array in favor of newer timeline fences), which will
>further reduce common code.
>
>Ok, I will update this series by adding execbuf3 and send out soon.
>

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {
        __u32 ctx_id;		/* previously execbuffer2.rsvd1 */

        __u32 batch_count;
        __u64 batch_addr_ptr;	/* Pointer to an array of batch gpu virtual addresses */

        __u64 flags;
#define I915_EXEC3_RING_MASK              (0x3f)
#define I915_EXEC3_DEFAULT                (0<<0)
#define I915_EXEC3_RENDER                 (1<<0)
#define I915_EXEC3_BSD                    (2<<0)
#define I915_EXEC3_BLT                    (3<<0)
#define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6)
#define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8)
#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10)
#define I915_EXEC3_FENCE_OUT            (1<<11)
#define I915_EXEC3_FENCE_SUBMIT         (1<<12)

        __u64 in_out_fence;		/* previously execbuffer2.rsvd2 */

        __u64 extensions;		/* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
};

With this, user can pass in batch addresses and count directly,
instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not
applicable to BM_BIND mode.
I have also removed fence array support (execbuffer2.cliprects_ptr)
as we have timeline fence array support. Is that fine?
Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

>Niranjana
>
>>-Daniel
>>-- 
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 20:35           ` Jason Ekstrand
  (?)
@ 2022-06-03  7:20           ` Lionel Landwerlin
  2022-06-03 23:51               ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-03  7:20 UTC (permalink / raw)
  To: Jason Ekstrand, Niranjana Vishwanathapura
  Cc: Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 9238 bytes --]

On 02/06/2022 23:35, Jason Ekstrand wrote:
> On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura 
> <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>     >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding the mapping in an
>     >> > +async worker. The binding and unbinding will work like a
>     special GPU engine.
>     >> > +The binding and unbinding operations are serialized and will
>     wait on specified
>     >> > +input fences before the operation and will signal the output
>     fences upon the
>     >> > +completion of the operation. Due to serialization,
>     completion of an operation
>     >> > +will also indicate that all previous operations are also
>     complete.
>     >>
>     >> I guess we should avoid saying "will immediately start
>     binding/unbinding" if
>     >> there are fences involved.
>     >>
>     >> And the fact that it's happening in an async worker seem to
>     imply it's not
>     >> immediate.
>     >>
>
>     Ok, will fix.
>     This was added because in earlier design binding was deferred
>     until next execbuff.
>     But now it is non-deferred (immediate in that sense). But yah,
>     this is confusing
>     and will fix it.
>
>     >>
>     >> I have a question on the behavior of the bind operation when no
>     input fence
>     >> is provided. Let say I do :
>     >>
>     >> VM_BIND (out_fence=fence1)
>     >>
>     >> VM_BIND (out_fence=fence2)
>     >>
>     >> VM_BIND (out_fence=fence3)
>     >>
>     >>
>     >> In what order are the fences going to be signaled?
>     >>
>     >> In the order of VM_BIND ioctls? Or out of order?
>     >>
>     >> Because you wrote "serialized I assume it's : in order
>     >>
>
>     Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind will use
>     the same queue and hence are ordered.
>
>     >>
>     >> One thing I didn't realize is that because we only get one
>     "VM_BIND" engine,
>     >> there is a disconnect from the Vulkan specification.
>     >>
>     >> In Vulkan VM_BIND operations are serialized but per engine.
>     >>
>     >> So you could have something like this :
>     >>
>     >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >>
>     >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >>
>     >>
>     >> fence1 is not signaled
>     >>
>     >> fence3 is signaled
>     >>
>     >> So the second VM_BIND will proceed before the first VM_BIND.
>     >>
>     >>
>     >> I guess we can deal with that scenario in userspace by doing
>     the wait
>     >> ourselves in one thread per engines.
>     >>
>     >> But then it makes the VM_BIND input fences useless.
>     >>
>     >>
>     >> Daniel : what do you think? Should be rework this or just deal
>     with wait
>     >> fences in userspace?
>     >>
>     >
>     >My opinion is rework this but make the ordering via an engine
>     param optional.
>     >
>     >e.g. A VM can be configured so all binds are ordered within the VM
>     >
>     >e.g. A VM can be configured so all binds accept an engine
>     argument (in
>     >the case of the i915 likely this is a gem context handle) and binds
>     >ordered with respect to that engine.
>     >
>     >This gives UMDs options as the later likely consumes more KMD
>     resources
>     >so if a different UMD can live with binds being ordered within the VM
>     >they can use a mode consuming less resources.
>     >
>
>     I think we need to be careful here if we are looking for some out of
>     (submission) order completion of vm_bind/unbind.
>     In-order completion means, in a batch of binds and unbinds to be
>     completed in-order, user only needs to specify in-fence for the
>     first bind/unbind call and the our-fence for the last bind/unbind
>     call. Also, the VA released by an unbind call can be re-used by
>     any subsequent bind call in that in-order batch.
>
>     These things will break if binding/unbinding were to be allowed to
>     go out of order (of submission) and user need to be extra careful
>     not to run into pre-mature triggereing of out-fence and bind failing
>     as VA is still in use etc.
>
>     Also, VM_BIND binds the provided mapping on the specified address
>     space
>     (VM). So, the uapi is not engine/context specific.
>
>     We can however add a 'queue' to the uapi which can be one from the
>     pre-defined queues,
>     I915_VM_BIND_QUEUE_0
>     I915_VM_BIND_QUEUE_1
>     ...
>     I915_VM_BIND_QUEUE_(N-1)
>
>     KMD will spawn an async work queue for each queue which will only
>     bind the mappings on that queue in the order of submission.
>     User can assign the queue to per engine or anything like that.
>
>     But again here, user need to be careful and not deadlock these
>     queues with circular dependency of fences.
>
>     I prefer adding this later an as extension based on whether it
>     is really helping with the implementation.
>
>
> I can tell you right now that having everything on a single in-order 
> queue will not get us the perf we want. What vulkan really wants is 
> one of two things:
>
>  1. No implicit ordering of VM_BIND ops.  They just happen in whatever 
> their dependencies are resolved and we ensure ordering ourselves by 
> having a syncobj in the VkQueue.
>
>  2. The ability to create multiple VM_BIND queues.  We need at least 2 
> but I don't see why there needs to be a limit besides the limits the 
> i915 API already has on the number of engines.  Vulkan could expose 
> multiple sparse binding queues to the client if it's not arbitrarily 
> limited.
>
> Why?  Because Vulkan has two basic kind of bind operations and we 
> don't want any dependencies between them:
>
>  1. Immediate.  These happen right after BO creation or maybe as part 
> of vkBindImageMemory() or VkBindBufferMemory().  These don't happen on 
> a queue and we don't want them serialized with anything.  To 
> synchronize with submit, we'll have a syncobj in the VkDevice which is 
> signaled by all immediate bind operations and make submits wait on it.
>
>  2. Queued (sparse): These happen on a VkQueue which may be the same 
> as a render/compute queue or may be its own queue.  It's up to us what 
> we want to advertise.  From the Vulkan API PoV, this is like any other 
> queue.  Operations on it wait on and signal semaphores.  If we have a 
> VM_BIND engine, we'd provide syncobjs to wait and signal just like we 
> do in execbuf().
>
> The important thing is that we don't want one type of operation to 
> block on the other.  If immediate binds are blocking on sparse binds, 
> it's going to cause over-synchronization issues.
>
> In terms of the internal implementation, I know that there's going to 
> be a lock on the VM and that we can't actually do these things in 
> parallel.  That's fine.  Once the dma_fences have signaled and we're 
> unblocked to do the bind operation, I don't care if there's a bit of 
> synchronization due to locking.  That's expected.  What we can't 
> afford to have is an immediate bind operation suddenly blocking on a 
> sparse operation which is blocked on a compute job that's going to run 
> for another 5ms.
>
> For reference, Windows solves this by allowing arbitrarily many paging 
> queues (what they call a VM_BIND engine/queue).  That design works 
> pretty well and solves the problems in question.  Again, we could just 
> make everything out-of-order and require using syncobjs to order 
> things as userspace wants. That'd be fine too.
>
> One more note while I'm here: danvet said something on IRC about 
> VM_BIND queues waiting for syncobjs to materialize.  We don't really 
> want/need this.  We already have all the machinery in userspace to 
> handle wait-before-signal and waiting for syncobj fences to 
> materialize and that machinery is on by default.  It would actually 
> take MORE work in Mesa to turn it off and take advantage of the kernel 
> being able to wait for syncobjs to materialize.  Also, getting that 
> right is ridiculously hard and I really don't want to get it wrong in 
> kernel space. When we do memory fences, wait-before-signal will be a 
> thing.  We don't need to try and make it a thing for syncobj.
>
> --Jason


Thanks Jason,


I missed the bit in the Vulkan spec that we're allowed to have a sparse 
queue that does not implement either graphics or compute operations :

    "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
    support in queue families that also include

      graphics and compute support, other implementations may only
    expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue

      family."


So it can all be all a vm_bind engine that just does bind/unbind 
operations.


But yes we need another engine for the immediate/non-sparse operations.


-Lionel


>     Daniel, any thoughts?
>
>     Niranjana
>
>     >Matt
>     >
>     >>
>     >> Sorry I noticed this late.
>     >>
>     >>
>     >> -Lionel
>     >>
>     >>
>

[-- Attachment #2: Type: text/html, Size: 14209 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-03  7:20           ` Lionel Landwerlin
@ 2022-06-03 23:51               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-03 23:51 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Jason Ekstrand, Daniel Vetter,
	Christian König

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>   On 02/06/2022 23:35, Jason Ekstrand wrote:
>
>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
>       the mapping in an
>       >> > +async worker. The binding and unbinding will work like a special
>       GPU engine.
>       >> > +The binding and unbinding operations are serialized and will
>       wait on specified
>       >> > +input fences before the operation and will signal the output
>       fences upon the
>       >> > +completion of the operation. Due to serialization, completion of
>       an operation
>       >> > +will also indicate that all previous operations are also
>       complete.
>       >>
>       >> I guess we should avoid saying "will immediately start
>       binding/unbinding" if
>       >> there are fences involved.
>       >>
>       >> And the fact that it's happening in an async worker seem to imply
>       it's not
>       >> immediate.
>       >>
>
>       Ok, will fix.
>       This was added because in earlier design binding was deferred until
>       next execbuff.
>       But now it is non-deferred (immediate in that sense). But yah, this is
>       confusing
>       and will fix it.
>
>       >>
>       >> I have a question on the behavior of the bind operation when no
>       input fence
>       >> is provided. Let say I do :
>       >>
>       >> VM_BIND (out_fence=fence1)
>       >>
>       >> VM_BIND (out_fence=fence2)
>       >>
>       >> VM_BIND (out_fence=fence3)
>       >>
>       >>
>       >> In what order are the fences going to be signaled?
>       >>
>       >> In the order of VM_BIND ioctls? Or out of order?
>       >>
>       >> Because you wrote "serialized I assume it's : in order
>       >>
>
>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind
>       will use
>       the same queue and hence are ordered.
>
>       >>
>       >> One thing I didn't realize is that because we only get one
>       "VM_BIND" engine,
>       >> there is a disconnect from the Vulkan specification.
>       >>
>       >> In Vulkan VM_BIND operations are serialized but per engine.
>       >>
>       >> So you could have something like this :
>       >>
>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>       >>
>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>       >>
>       >>
>       >> fence1 is not signaled
>       >>
>       >> fence3 is signaled
>       >>
>       >> So the second VM_BIND will proceed before the first VM_BIND.
>       >>
>       >>
>       >> I guess we can deal with that scenario in userspace by doing the
>       wait
>       >> ourselves in one thread per engines.
>       >>
>       >> But then it makes the VM_BIND input fences useless.
>       >>
>       >>
>       >> Daniel : what do you think? Should be rework this or just deal with
>       wait
>       >> fences in userspace?
>       >>
>       >
>       >My opinion is rework this but make the ordering via an engine param
>       optional.
>       >
>       >e.g. A VM can be configured so all binds are ordered within the VM
>       >
>       >e.g. A VM can be configured so all binds accept an engine argument
>       (in
>       >the case of the i915 likely this is a gem context handle) and binds
>       >ordered with respect to that engine.
>       >
>       >This gives UMDs options as the later likely consumes more KMD
>       resources
>       >so if a different UMD can live with binds being ordered within the VM
>       >they can use a mode consuming less resources.
>       >
>
>       I think we need to be careful here if we are looking for some out of
>       (submission) order completion of vm_bind/unbind.
>       In-order completion means, in a batch of binds and unbinds to be
>       completed in-order, user only needs to specify in-fence for the
>       first bind/unbind call and the our-fence for the last bind/unbind
>       call. Also, the VA released by an unbind call can be re-used by
>       any subsequent bind call in that in-order batch.
>
>       These things will break if binding/unbinding were to be allowed to
>       go out of order (of submission) and user need to be extra careful
>       not to run into pre-mature triggereing of out-fence and bind failing
>       as VA is still in use etc.
>
>       Also, VM_BIND binds the provided mapping on the specified address
>       space
>       (VM). So, the uapi is not engine/context specific.
>
>       We can however add a 'queue' to the uapi which can be one from the
>       pre-defined queues,
>       I915_VM_BIND_QUEUE_0
>       I915_VM_BIND_QUEUE_1
>       ...
>       I915_VM_BIND_QUEUE_(N-1)
>
>       KMD will spawn an async work queue for each queue which will only
>       bind the mappings on that queue in the order of submission.
>       User can assign the queue to per engine or anything like that.
>
>       But again here, user need to be careful and not deadlock these
>       queues with circular dependency of fences.
>
>       I prefer adding this later an as extension based on whether it
>       is really helping with the implementation.
>
>     I can tell you right now that having everything on a single in-order
>     queue will not get us the perf we want.  What vulkan really wants is one
>     of two things:
>      1. No implicit ordering of VM_BIND ops.  They just happen in whatever
>     their dependencies are resolved and we ensure ordering ourselves by
>     having a syncobj in the VkQueue.
>      2. The ability to create multiple VM_BIND queues.  We need at least 2
>     but I don't see why there needs to be a limit besides the limits the
>     i915 API already has on the number of engines.  Vulkan could expose
>     multiple sparse binding queues to the client if it's not arbitrarily
>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already
has on the number of engines"? I am not sure if there is such an uapi today.

I am trying to see how many queues we need and don't want it to be arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.

>     Why?  Because Vulkan has two basic kind of bind operations and we don't
>     want any dependencies between them:
>      1. Immediate.  These happen right after BO creation or maybe as part of
>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
>     queue and we don't want them serialized with anything.  To synchronize
>     with submit, we'll have a syncobj in the VkDevice which is signaled by
>     all immediate bind operations and make submits wait on it.
>      2. Queued (sparse): These happen on a VkQueue which may be the same as
>     a render/compute queue or may be its own queue.  It's up to us what we
>     want to advertise.  From the Vulkan API PoV, this is like any other
>     queue.  Operations on it wait on and signal semaphores.  If we have a
>     VM_BIND engine, we'd provide syncobjs to wait and signal just like we do
>     in execbuf().
>     The important thing is that we don't want one type of operation to block
>     on the other.  If immediate binds are blocking on sparse binds, it's
>     going to cause over-synchronization issues.
>     In terms of the internal implementation, I know that there's going to be
>     a lock on the VM and that we can't actually do these things in
>     parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.

>     unblocked to do the bind operation, I don't care if there's a bit of
>     synchronization due to locking.  That's expected.  What we can't afford
>     to have is an immediate bind operation suddenly blocking on a sparse
>     operation which is blocked on a compute job that's going to run for
>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to clarify.

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many paging
>     queues (what they call a VM_BIND engine/queue).  That design works
>     pretty well and solves the problems in question.  Again, we could just
>     make everything out-of-order and require using syncobjs to order things
>     as userspace wants. That'd be fine too.
>     One more note while I'm here: danvet said something on IRC about VM_BIND
>     queues waiting for syncobjs to materialize.  We don't really want/need
>     this.  We already have all the machinery in userspace to handle
>     wait-before-signal and waiting for syncobj fences to materialize and
>     that machinery is on by default.  It would actually take MORE work in
>     Mesa to turn it off and take advantage of the kernel being able to wait
>     for syncobjs to materialize.  Also, getting that right is ridiculously
>     hard and I really don't want to get it wrong in kernel space.  When we
>     do memory fences, wait-before-signal will be a thing.  We don't need to
>     try and make it a thing for syncobj.
>     --Jason
>
>   Thanks Jason,
>
>   I missed the bit in the Vulkan spec that we're allowed to have a sparse
>   queue that does not implement either graphics or compute operations :
>
>     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
>     support in queue families that also include
>
>      graphics and compute support, other implementations may only expose a
>     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
>      family."
>
>   So it can all be all a vm_bind engine that just does bind/unbind
>   operations.
>
>   But yes we need another engine for the immediate/non-sparse operations.
>
>   -Lionel
>
>      
>
>       Daniel, any thoughts?
>
>       Niranjana
>
>       >Matt
>       >
>       >>
>       >> Sorry I noticed this late.
>       >>
>       >>
>       >> -Lionel
>       >>
>       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-03 23:51               ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-03 23:51 UTC (permalink / raw)
  To: Lionel Landwerlin
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>   On 02/06/2022 23:35, Jason Ekstrand wrote:
>
>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
>       the mapping in an
>       >> > +async worker. The binding and unbinding will work like a special
>       GPU engine.
>       >> > +The binding and unbinding operations are serialized and will
>       wait on specified
>       >> > +input fences before the operation and will signal the output
>       fences upon the
>       >> > +completion of the operation. Due to serialization, completion of
>       an operation
>       >> > +will also indicate that all previous operations are also
>       complete.
>       >>
>       >> I guess we should avoid saying "will immediately start
>       binding/unbinding" if
>       >> there are fences involved.
>       >>
>       >> And the fact that it's happening in an async worker seem to imply
>       it's not
>       >> immediate.
>       >>
>
>       Ok, will fix.
>       This was added because in earlier design binding was deferred until
>       next execbuff.
>       But now it is non-deferred (immediate in that sense). But yah, this is
>       confusing
>       and will fix it.
>
>       >>
>       >> I have a question on the behavior of the bind operation when no
>       input fence
>       >> is provided. Let say I do :
>       >>
>       >> VM_BIND (out_fence=fence1)
>       >>
>       >> VM_BIND (out_fence=fence2)
>       >>
>       >> VM_BIND (out_fence=fence3)
>       >>
>       >>
>       >> In what order are the fences going to be signaled?
>       >>
>       >> In the order of VM_BIND ioctls? Or out of order?
>       >>
>       >> Because you wrote "serialized I assume it's : in order
>       >>
>
>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind
>       will use
>       the same queue and hence are ordered.
>
>       >>
>       >> One thing I didn't realize is that because we only get one
>       "VM_BIND" engine,
>       >> there is a disconnect from the Vulkan specification.
>       >>
>       >> In Vulkan VM_BIND operations are serialized but per engine.
>       >>
>       >> So you could have something like this :
>       >>
>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>       >>
>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>       >>
>       >>
>       >> fence1 is not signaled
>       >>
>       >> fence3 is signaled
>       >>
>       >> So the second VM_BIND will proceed before the first VM_BIND.
>       >>
>       >>
>       >> I guess we can deal with that scenario in userspace by doing the
>       wait
>       >> ourselves in one thread per engines.
>       >>
>       >> But then it makes the VM_BIND input fences useless.
>       >>
>       >>
>       >> Daniel : what do you think? Should be rework this or just deal with
>       wait
>       >> fences in userspace?
>       >>
>       >
>       >My opinion is rework this but make the ordering via an engine param
>       optional.
>       >
>       >e.g. A VM can be configured so all binds are ordered within the VM
>       >
>       >e.g. A VM can be configured so all binds accept an engine argument
>       (in
>       >the case of the i915 likely this is a gem context handle) and binds
>       >ordered with respect to that engine.
>       >
>       >This gives UMDs options as the later likely consumes more KMD
>       resources
>       >so if a different UMD can live with binds being ordered within the VM
>       >they can use a mode consuming less resources.
>       >
>
>       I think we need to be careful here if we are looking for some out of
>       (submission) order completion of vm_bind/unbind.
>       In-order completion means, in a batch of binds and unbinds to be
>       completed in-order, user only needs to specify in-fence for the
>       first bind/unbind call and the our-fence for the last bind/unbind
>       call. Also, the VA released by an unbind call can be re-used by
>       any subsequent bind call in that in-order batch.
>
>       These things will break if binding/unbinding were to be allowed to
>       go out of order (of submission) and user need to be extra careful
>       not to run into pre-mature triggereing of out-fence and bind failing
>       as VA is still in use etc.
>
>       Also, VM_BIND binds the provided mapping on the specified address
>       space
>       (VM). So, the uapi is not engine/context specific.
>
>       We can however add a 'queue' to the uapi which can be one from the
>       pre-defined queues,
>       I915_VM_BIND_QUEUE_0
>       I915_VM_BIND_QUEUE_1
>       ...
>       I915_VM_BIND_QUEUE_(N-1)
>
>       KMD will spawn an async work queue for each queue which will only
>       bind the mappings on that queue in the order of submission.
>       User can assign the queue to per engine or anything like that.
>
>       But again here, user need to be careful and not deadlock these
>       queues with circular dependency of fences.
>
>       I prefer adding this later an as extension based on whether it
>       is really helping with the implementation.
>
>     I can tell you right now that having everything on a single in-order
>     queue will not get us the perf we want.  What vulkan really wants is one
>     of two things:
>      1. No implicit ordering of VM_BIND ops.  They just happen in whatever
>     their dependencies are resolved and we ensure ordering ourselves by
>     having a syncobj in the VkQueue.
>      2. The ability to create multiple VM_BIND queues.  We need at least 2
>     but I don't see why there needs to be a limit besides the limits the
>     i915 API already has on the number of engines.  Vulkan could expose
>     multiple sparse binding queues to the client if it's not arbitrarily
>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already
has on the number of engines"? I am not sure if there is such an uapi today.

I am trying to see how many queues we need and don't want it to be arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.

>     Why?  Because Vulkan has two basic kind of bind operations and we don't
>     want any dependencies between them:
>      1. Immediate.  These happen right after BO creation or maybe as part of
>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
>     queue and we don't want them serialized with anything.  To synchronize
>     with submit, we'll have a syncobj in the VkDevice which is signaled by
>     all immediate bind operations and make submits wait on it.
>      2. Queued (sparse): These happen on a VkQueue which may be the same as
>     a render/compute queue or may be its own queue.  It's up to us what we
>     want to advertise.  From the Vulkan API PoV, this is like any other
>     queue.  Operations on it wait on and signal semaphores.  If we have a
>     VM_BIND engine, we'd provide syncobjs to wait and signal just like we do
>     in execbuf().
>     The important thing is that we don't want one type of operation to block
>     on the other.  If immediate binds are blocking on sparse binds, it's
>     going to cause over-synchronization issues.
>     In terms of the internal implementation, I know that there's going to be
>     a lock on the VM and that we can't actually do these things in
>     parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.

>     unblocked to do the bind operation, I don't care if there's a bit of
>     synchronization due to locking.  That's expected.  What we can't afford
>     to have is an immediate bind operation suddenly blocking on a sparse
>     operation which is blocked on a compute job that's going to run for
>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to clarify.

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many paging
>     queues (what they call a VM_BIND engine/queue).  That design works
>     pretty well and solves the problems in question.  Again, we could just
>     make everything out-of-order and require using syncobjs to order things
>     as userspace wants. That'd be fine too.
>     One more note while I'm here: danvet said something on IRC about VM_BIND
>     queues waiting for syncobjs to materialize.  We don't really want/need
>     this.  We already have all the machinery in userspace to handle
>     wait-before-signal and waiting for syncobj fences to materialize and
>     that machinery is on by default.  It would actually take MORE work in
>     Mesa to turn it off and take advantage of the kernel being able to wait
>     for syncobjs to materialize.  Also, getting that right is ridiculously
>     hard and I really don't want to get it wrong in kernel space.  When we
>     do memory fences, wait-before-signal will be a thing.  We don't need to
>     try and make it a thing for syncobj.
>     --Jason
>
>   Thanks Jason,
>
>   I missed the bit in the Vulkan spec that we're allowed to have a sparse
>   queue that does not implement either graphics or compute operations :
>
>     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
>     support in queue families that also include
>
>      graphics and compute support, other implementations may only expose a
>     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
>      family."
>
>   So it can all be all a vm_bind engine that just does bind/unbind
>   operations.
>
>   But yes we need another engine for the immediate/non-sparse operations.
>
>   -Lionel
>
>      
>
>       Daniel, any thoughts?
>
>       Niranjana
>
>       >Matt
>       >
>       >>
>       >> Sorry I noticed this late.
>       >>
>       >>
>       >> -Lionel
>       >>
>       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-02 20:48       ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-06-06 20:45         ` Zeng, Oak
  -1 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-06 20:45 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: Brost, Matthew, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, jason, Vetter,  Daniel, christian.koenig



Regards,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 2, 2022 4:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>; Brost, Matthew <matthew.brost@intel.com>;
> Hellstrom, Thomas <thomas.hellstrom@intel.com>; jason@jlekstrand.net;
> Wilson, Chris P <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
> >
> >
> >Regards,
> >Oak
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >> Niranjana Vishwanathapura
> >> Sent: May 17, 2022 2:32 PM
> >> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> >> Daniel <daniel.vetter@intel.com>
> >> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> >> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> >> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> >> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> >>
> >> VM_BIND design document with description of intended use cases.
> >>
> >> v2: Add more documentation and format as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura
> >> <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/driver-api/dma-buf.rst   |   2 +
> >>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> >> +++++++++++++++++++++++++
> >>  Documentation/gpu/rfc/index.rst        |   4 +
> >>  3 files changed, 310 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >>
> >> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> >> api/dma-buf.rst
> >> index 36a76cbe9095..64cb924ec5bb 100644
> >> --- a/Documentation/driver-api/dma-buf.rst
> >> +++ b/Documentation/driver-api/dma-buf.rst
> >> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
> >>  .. kernel-doc:: include/linux/sync_file.h
> >>     :internal:
> >>
> >> +.. _indefinite_dma_fences:
> >> +
> >>  Indefinite DMA Fences
> >>  ~~~~~~~~~~~~~~~~~~~~~
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> >> b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> new file mode 100644
> >> index 000000000000..f1be560d313c
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> @@ -0,0 +1,304 @@
> >> +==========================================
> >> +I915 VM_BIND feature design and use cases
> >> +==========================================
> >> +
> >> +VM_BIND feature
> >> +================
> >> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> >> buffer
> >> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> >> +specified address space (VM). These mappings (also referred to as persistent
> >> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> >> +issued by the UMD, without user having to provide a list of all required
> >> +mappings during each submission (as required by older execbuff mode).
> >> +
> >> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> >> +to specify how the binding/unbinding should sync with other operations
> >> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> >> +for non-Compute contexts (See struct
> >> drm_i915_vm_bind_ext_timeline_fences).
> >> +For Compute contexts, they will be user/memory fences (See struct
> >> +drm_i915_vm_bind_ext_user_fence).
> >> +
> >> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> >> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> >> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> extension.
> >> +
> >> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in
> >> an
> >> +async worker. The binding and unbinding will work like a special GPU engine.
> >> +The binding and unbinding operations are serialized and will wait on specified
> >> +input fences before the operation and will signal the output fences upon the
> >> +completion of the operation. Due to serialization, completion of an operation
> >> +will also indicate that all previous operations are also complete.
> >
> >Hi,
> >
> >Is user required to wait for the out fence be signaled before submit a gpu job
> using the vm_bind address?
> >Or is user required to order the gpu job to make gpu job run after vm_bind out
> fence signaled?
> >
> 
> Thanks Oak,
> Either should be fine and up to user how to use vm_bind/unbind out-fence.
> 
> >I think there could be different behavior on a non-faultable platform and a
> faultable platform, such as on a non-faultable
> >Platform, gpu job is required to be order after vm_bind out fence signaling; and
> on a faultable platform, there is no such
> >Restriction since vm bind can be finished in the fault handler?
> >
> 
> With GPU page faults handler, out fence won't be needed as residency is
> purely managed by page fault handler populating page tables (there is a
> mention of it in GPU Page Faults section below).
> 
> >Should we document such thing?
> >
> 
> We don't talk much about GPU page faults case in this document as that may
> warrent a separate rfc when we add page faults support. We did mention it
> in couple places to ensure our locking design here is extensible to gpu
> page faults case.

Ok, that makes sense to me. Thanks for explaining.

Regards,
Oak

> 
> Niranjana
> 
> >Regards,
> >Oak
> >
> >
> >> +
> >> +VM_BIND features include:
> >> +
> >> +* Multiple Virtual Address (VA) mappings can map to the same physical
> pages
> >> +  of an object (aliasing).
> >> +* VA mapping can map to a partial section of the BO (partial binding).
> >> +* Support capture of persistent mappings in the dump upon GPU error.
> >> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> >> +  use cases will be helpful.
> >> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> >> +* Support for userptr gem objects (no special uapi is required for this).
> >> +
> >> +Execbuff ioctl in VM_BIND mode
> >> +-------------------------------
> >> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> >> +older method. A VM in VM_BIND mode will not support older execbuff
> mode of
> >> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.
> Hence,
> >> +no support for implicit sync. It is expected that the below work will be able
> >> +to support requirements of object dependency setting in all use cases:
> >> +
> >> +"dma-buf: Add an API for exporting sync files"
> >> +(https://lwn.net/Articles/859290/)
> >> +
> >> +This also means, we need an execbuff extension to pass in the batch
> >> +buffer addresses (See struct
> >> drm_i915_gem_execbuffer_ext_batch_addresses).
> >> +
> >> +If at all execlist support in execbuff ioctl is deemed necessary for
> >> +implicit sync in certain use cases, then support can be added later.
> >> +
> >> +In VM_BIND mode, VA allocation is completely managed by the user instead
> of
> >> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> >> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode
> will
> >> not
> >> +be using the i915_vma active reference tracking. It will instead use dma-resv
> >> +object for that (See `VM_BIND dma_resv usage`_).
> >> +
> >> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> >> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> >> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned
> up
> >> +by clearly separating out the functionalities where the VM_BIND mode
> differs
> >> +from older method and they should be moved to separate files.
> >> +
> >> +VM_PRIVATE objects
> >> +-------------------
> >> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> >> +exported. Hence these BOs are referred to as Shared BOs.
> >> +During each execbuff submission, the request fence must be added to the
> >> +dma-resv fence list of all shared BOs mapped on the VM.
> >> +
> >> +VM_BIND feature introduces an optimization where user can create BO
> which
> >> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> >> during
> >> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped
> on
> >> +the VM they are private to and can't be dma-buf exported.
> >> +All private BOs of a VM share the dma-resv object. Hence during each
> execbuff
> >> +submission, they need only one dma-resv fence list updated. Thus, the fast
> >> +path (where required mappings are already bound) submission latency is
> O(1)
> >> +w.r.t the number of VM private BOs.
> >> +
> >> +VM_BIND locking hirarchy
> >> +-------------------------
> >> +The locking design here supports the older (execlist based) execbuff mode,
> the
> >> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and
> possible
> >> future
> >> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> >> +The older execbuff mode and the newer VM_BIND mode without page
> faults
> >> manages
> >> +residency of backing storage using dma_fence. The VM_BIND mode with
> page
> >> faults
> >> +and the system allocator support do not use any dma_fence at all.
> >> +
> >> +VM_BIND locking order is as below.
> >> +
> >> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> >> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing
> the
> >> +   mapping.
> >> +
> >> +   In future, when GPU page faults are supported, we can potentially use a
> >> +   rwsem instead, so that multiple page fault handlers can take the read side
> >> +   lock to lookup the mapping and hence can run in parallel.
> >> +   The older execbuff mode of binding do not need this lock.
> >> +
> >> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs
> to
> >> +   be held while binding/unbinding a vma in the async worker and while
> updating
> >> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> >> +   share a dma-resv object.
> >> +
> >> +   The future system allocator support will use the HMM prescribed locking
> >> +   instead.
> >> +
> >> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> >> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> >> +
> >> +When GPU page faults are supported, the execbuff path do not take any of
> >> these
> >> +locks. There we will simply smash the new batch buffer address into the ring
> >> and
> >> +then tell the scheduler run that. The lock taking only happens from the page
> >> +fault handler, where we take lock-A in read mode, whichever lock-B we
> need to
> >> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm
> for
> >> +system allocator) and some additional locks (lock-D) for taking care of page
> >> +table races. Page fault mode should not need to ever manipulate the vm
> lists,
> >> +so won't ever need lock-C.
> >> +
> >> +VM_BIND LRU handling
> >> +---------------------
> >> +We need to ensure VM_BIND mapped objects are properly LRU tagged to
> avoid
> >> +performance degradation. We will also need support for bulk LRU movement
> of
> >> +VM_BIND objects to avoid additional latencies in execbuff path.
> >> +
> >> +The page table pages are similar to VM_BIND mapped objects (See
> >> +`Evictable page table allocations`_) and are maintained per VM and needs to
> >> +be pinned in memory when VM is made active (ie., upon an execbuff call
> with
> >> +that VM). So, bulk LRU movement of page table pages is also needed.
> >> +
> >> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> >> +over to the ttm LRU in some fashion to make sure we once again have a
> >> reasonable
> >> +and consistent memory aging and reclaim architecture.
> >> +
> >> +VM_BIND dma_resv usage
> >> +-----------------------
> >> +Fences needs to be added to all VM_BIND mapped objects. During each
> >> execbuff
> >> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> >> prevent
> >> +over sync (See enum dma_resv_usage). One can override it with either
> >> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during
> object
> >> dependency
> >> +setting (either through explicit or implicit mechanism).
> >> +
> >> +When vm_bind is called for a non-private object while the VM is already
> >> +active, the fences need to be copied from VM's shared dma-resv object
> >> +(common to all private objects of the VM) to this non-private object.
> >> +If this results in performance degradation, then some optimization will
> >> +be needed here. This is not a problem for VM's private objects as they use
> >> +shared dma-resv object which is always updated on each execbuff
> submission.
> >> +
> >> +Also, in VM_BIND mode, use dma-resv apis for determining object
> activeness
> >> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not
> use
> >> the
> >> +older i915_vma active reference tracking which is deprecated. This should be
> >> +easier to get it working with the current TTM backend. We can remove the
> >> +i915_vma active reference tracking fully while supporting TTM backend for
> igfx.
> >> +
> >> +Evictable page table allocations
> >> +---------------------------------
> >> +Make pagetable allocations evictable and manage them similar to VM_BIND
> >> +mapped objects. Page table pages are similar to persistent mappings of a
> >> +VM (difference here are that the page table pages will not have an i915_vma
> >> +structure and after swapping pages back in, parent page link needs to be
> >> +updated).
> >> +
> >> +Mesa use case
> >> +--------------
> >> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan
> and
> >> Iris),
> >> +hence improving performance of CPU-bound applications. It also allows us to
> >> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> >> performance,
> >> +reducing CPU overhead becomes more impactful.
> >> +
> >> +
> >> +VM_BIND Compute support
> >> +========================
> >> +
> >> +User/Memory Fence
> >> +------------------
> >> +The idea is to take a user specified virtual address and install an interrupt
> >> +handler to wake up the current task when the memory location passes the
> user
> >> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> >> +user fence, specified value will be written at the specified virtual address
> >> +and wakeup the waiting process. User can wait on a user fence with the
> >> +gem_wait_user_fence ioctl.
> >> +
> >> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> >> +interrupt within their batches after updating the value to have sub-batch
> >> +precision on the wakeup. Each batch can signal a user fence to indicate
> >> +the completion of next level batch. The completion of very first level batch
> >> +needs to be signaled by the command streamer. The user must provide the
> >> +user/memory fence for this via the
> >> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> >> +extension of execbuff ioctl, so that KMD can setup the command streamer
> to
> >> +signal it.
> >> +
> >> +User/Memory fence can also be supplied to the kernel driver to signal/wake
> up
> >> +the user process after completion of an asynchronous operation.
> >> +
> >> +When VM_BIND ioctl was provided with a user/memory fence via the
> >> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> >> completion
> >> +of binding of that mapping. All async binds/unbinds are serialized, hence
> >> +signaling of user/memory fence also indicate the completion of all previous
> >> +binds/unbinds.
> >> +
> >> +This feature will be derived from the below original work:
> >> +https://patchwork.freedesktop.org/patch/349417/
> >> +
> >> +Long running Compute contexts
> >> +------------------------------
> >> +Usage of dma-fence expects that they complete in reasonable amount of
> time.
> >> +Compute on the other hand can be long running. Hence it is appropriate for
> >> +compute to use user/memory fence and dma-fence usage will be limited to
> >> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> >> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute
> must
> >> opt-in
> >> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> flag
> >> during
> >> +context creation. The dma-fence based user interfaces like gem_wait ioctl
> and
> >> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> >> +not valid as well and is anyway not supported in VM_BIND mode.
> >> +
> >> +Where GPU page faults are not available, kernel driver upon buffer
> invalidation
> >> +will initiate a suspend (preemption) of long running context with a dma-
> fence
> >> +attached to it. And upon completion of that suspend fence, finish the
> >> +invalidation, revalidate the BO and then resume the compute context. This is
> >> +done by having a per-context preempt fence (also called suspend fence)
> >> proxying
> >> +as i915_request fence. This suspend fence is enabled when someone tries to
> >> wait
> >> +on it, which then triggers the context preemption.
> >> +
> >> +As this support for context suspension using a preempt fence and the
> resume
> >> work
> >> +for the compute mode contexts can get tricky to get it right, it is better to
> >> +add this support in drm scheduler so that multiple drivers can make use of it.
> >> +That means, it will have a dependency on i915 drm scheduler conversion with
> >> GuC
> >> +scheduler backend. This should be fine, as the plan is to support compute
> mode
> >> +contexts only with GuC scheduler backend (at least initially). This is much
> >> +easier to support with VM_BIND mode compared to the current heavier
> >> execbuff
> >> +path resource attachment.
> >> +
> >> +Low Latency Submission
> >> +-----------------------
> >> +Allows compute UMD to directly submit GPU jobs instead of through
> execbuff
> >> +ioctl. This is made possible by VM_BIND is not being synchronized against
> >> +execbuff. VM_BIND allows bind/unbind of mappings required for the
> directly
> >> +submitted jobs.
> >> +
> >> +Other VM_BIND use cases
> >> +========================
> >> +
> >> +Debugger
> >> +---------
> >> +With debug event interface user space process (debugger) is able to keep
> track
> >> +of and act upon resources created by another process (debugged) and
> attached
> >> +to GPU via vm_bind interface.
> >> +
> >> +GPU page faults
> >> +----------------
> >> +GPU page faults when supported (in future), will only be supported in the
> >> +VM_BIND mode. While both the older execbuff mode and the newer
> VM_BIND
> >> mode of
> >> +binding will require using dma-fence to ensure residency, the GPU page
> faults
> >> +mode when supported, will not use any dma-fence as residency is purely
> >> managed
> >> +by installing and removing/invalidating page table entries.
> >> +
> >> +Page level hints settings
> >> +--------------------------
> >> +VM_BIND allows any hints setting per mapping instead of per BO.
> >> +Possible hints include read-only mapping, placement and atomicity.
> >> +Sub-BO level placement hint will be even more relevant with
> >> +upcoming GPU on-demand page fault support.
> >> +
> >> +Page level Cache/CLOS settings
> >> +-------------------------------
> >> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> >> +
> >> +Shared Virtual Memory (SVM) support
> >> +------------------------------------
> >> +VM_BIND interface can be used to map system memory directly (without
> gem
> >> BO
> >> +abstraction) using the HMM interface. SVM is only supported with GPU page
> >> +faults enabled.
> >> +
> >> +
> >> +Broder i915 cleanups
> >> +=====================
> >> +Supporting this whole new vm_bind mode of binding which comes with its
> own
> >> +use cases to support and the locking requirements requires proper
> integration
> >> +with the existing i915 driver. This calls for some broader i915 driver
> >> +cleanups/simplifications for maintainability of the driver going forward.
> >> +Here are few things identified and are being looked into.
> >> +
> >> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> >> feature
> >> +  do not use it and complexity it brings in is probably more than the
> >> +  performance advantage we get in legacy execbuff case.
> >> +- Remove vma->open_count counting
> >> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> >> using
> >> +  it. Instead use underlying BO's dma-resv fence list to determine if a
> i915_vma
> >> +  is active or not.
> >> +
> >> +
> >> +VM_BIND UAPI
> >> +=============
> >> +
> >> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> >> diff --git a/Documentation/gpu/rfc/index.rst
> b/Documentation/gpu/rfc/index.rst
> >> index 91e93a705230..7d10c36b268d 100644
> >> --- a/Documentation/gpu/rfc/index.rst
> >> +++ b/Documentation/gpu/rfc/index.rst
> >> @@ -23,3 +23,7 @@ host such documentation:
> >>  .. toctree::
> >>
> >>      i915_scheduler.rst
> >> +
> >> +.. toctree::
> >> +
> >> +    i915_vm_bind.rst
> >> --
> >> 2.21.0.rc0.32.g243a4c7e27
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-06 20:45         ` Zeng, Oak
  0 siblings, 0 replies; 121+ messages in thread
From: Zeng, Oak @ 2022-06-06 20:45 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: intel-gfx, dri-devel, Hellstrom, Thomas, Wilson, Chris P, Vetter,
	 Daniel, christian.koenig



Regards,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 2, 2022 4:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>; Brost, Matthew <matthew.brost@intel.com>;
> Hellstrom, Thomas <thomas.hellstrom@intel.com>; jason@jlekstrand.net;
> Wilson, Chris P <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
> >
> >
> >Regards,
> >Oak
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >> Niranjana Vishwanathapura
> >> Sent: May 17, 2022 2:32 PM
> >> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> >> Daniel <daniel.vetter@intel.com>
> >> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> >> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> >> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> >> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> >>
> >> VM_BIND design document with description of intended use cases.
> >>
> >> v2: Add more documentation and format as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura
> >> <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/driver-api/dma-buf.rst   |   2 +
> >>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> >> +++++++++++++++++++++++++
> >>  Documentation/gpu/rfc/index.rst        |   4 +
> >>  3 files changed, 310 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >>
> >> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> >> api/dma-buf.rst
> >> index 36a76cbe9095..64cb924ec5bb 100644
> >> --- a/Documentation/driver-api/dma-buf.rst
> >> +++ b/Documentation/driver-api/dma-buf.rst
> >> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
> >>  .. kernel-doc:: include/linux/sync_file.h
> >>     :internal:
> >>
> >> +.. _indefinite_dma_fences:
> >> +
> >>  Indefinite DMA Fences
> >>  ~~~~~~~~~~~~~~~~~~~~~
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> >> b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> new file mode 100644
> >> index 000000000000..f1be560d313c
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> @@ -0,0 +1,304 @@
> >> +==========================================
> >> +I915 VM_BIND feature design and use cases
> >> +==========================================
> >> +
> >> +VM_BIND feature
> >> +================
> >> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> >> buffer
> >> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> >> +specified address space (VM). These mappings (also referred to as persistent
> >> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> >> +issued by the UMD, without user having to provide a list of all required
> >> +mappings during each submission (as required by older execbuff mode).
> >> +
> >> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> >> +to specify how the binding/unbinding should sync with other operations
> >> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> >> +for non-Compute contexts (See struct
> >> drm_i915_vm_bind_ext_timeline_fences).
> >> +For Compute contexts, they will be user/memory fences (See struct
> >> +drm_i915_vm_bind_ext_user_fence).
> >> +
> >> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> >> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> >> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> extension.
> >> +
> >> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in
> >> an
> >> +async worker. The binding and unbinding will work like a special GPU engine.
> >> +The binding and unbinding operations are serialized and will wait on specified
> >> +input fences before the operation and will signal the output fences upon the
> >> +completion of the operation. Due to serialization, completion of an operation
> >> +will also indicate that all previous operations are also complete.
> >
> >Hi,
> >
> >Is user required to wait for the out fence be signaled before submit a gpu job
> using the vm_bind address?
> >Or is user required to order the gpu job to make gpu job run after vm_bind out
> fence signaled?
> >
> 
> Thanks Oak,
> Either should be fine and up to user how to use vm_bind/unbind out-fence.
> 
> >I think there could be different behavior on a non-faultable platform and a
> faultable platform, such as on a non-faultable
> >Platform, gpu job is required to be order after vm_bind out fence signaling; and
> on a faultable platform, there is no such
> >Restriction since vm bind can be finished in the fault handler?
> >
> 
> With GPU page faults handler, out fence won't be needed as residency is
> purely managed by page fault handler populating page tables (there is a
> mention of it in GPU Page Faults section below).
> 
> >Should we document such thing?
> >
> 
> We don't talk much about GPU page faults case in this document as that may
> warrent a separate rfc when we add page faults support. We did mention it
> in couple places to ensure our locking design here is extensible to gpu
> page faults case.

Ok, that makes sense to me. Thanks for explaining.

Regards,
Oak

> 
> Niranjana
> 
> >Regards,
> >Oak
> >
> >
> >> +
> >> +VM_BIND features include:
> >> +
> >> +* Multiple Virtual Address (VA) mappings can map to the same physical
> pages
> >> +  of an object (aliasing).
> >> +* VA mapping can map to a partial section of the BO (partial binding).
> >> +* Support capture of persistent mappings in the dump upon GPU error.
> >> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> >> +  use cases will be helpful.
> >> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> >> +* Support for userptr gem objects (no special uapi is required for this).
> >> +
> >> +Execbuff ioctl in VM_BIND mode
> >> +-------------------------------
> >> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> >> +older method. A VM in VM_BIND mode will not support older execbuff
> mode of
> >> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.
> Hence,
> >> +no support for implicit sync. It is expected that the below work will be able
> >> +to support requirements of object dependency setting in all use cases:
> >> +
> >> +"dma-buf: Add an API for exporting sync files"
> >> +(https://lwn.net/Articles/859290/)
> >> +
> >> +This also means, we need an execbuff extension to pass in the batch
> >> +buffer addresses (See struct
> >> drm_i915_gem_execbuffer_ext_batch_addresses).
> >> +
> >> +If at all execlist support in execbuff ioctl is deemed necessary for
> >> +implicit sync in certain use cases, then support can be added later.
> >> +
> >> +In VM_BIND mode, VA allocation is completely managed by the user instead
> of
> >> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> >> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode
> will
> >> not
> >> +be using the i915_vma active reference tracking. It will instead use dma-resv
> >> +object for that (See `VM_BIND dma_resv usage`_).
> >> +
> >> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> >> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> >> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned
> up
> >> +by clearly separating out the functionalities where the VM_BIND mode
> differs
> >> +from older method and they should be moved to separate files.
> >> +
> >> +VM_PRIVATE objects
> >> +-------------------
> >> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> >> +exported. Hence these BOs are referred to as Shared BOs.
> >> +During each execbuff submission, the request fence must be added to the
> >> +dma-resv fence list of all shared BOs mapped on the VM.
> >> +
> >> +VM_BIND feature introduces an optimization where user can create BO
> which
> >> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> >> during
> >> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped
> on
> >> +the VM they are private to and can't be dma-buf exported.
> >> +All private BOs of a VM share the dma-resv object. Hence during each
> execbuff
> >> +submission, they need only one dma-resv fence list updated. Thus, the fast
> >> +path (where required mappings are already bound) submission latency is
> O(1)
> >> +w.r.t the number of VM private BOs.
> >> +
> >> +VM_BIND locking hirarchy
> >> +-------------------------
> >> +The locking design here supports the older (execlist based) execbuff mode,
> the
> >> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and
> possible
> >> future
> >> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> >> +The older execbuff mode and the newer VM_BIND mode without page
> faults
> >> manages
> >> +residency of backing storage using dma_fence. The VM_BIND mode with
> page
> >> faults
> >> +and the system allocator support do not use any dma_fence at all.
> >> +
> >> +VM_BIND locking order is as below.
> >> +
> >> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> >> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing
> the
> >> +   mapping.
> >> +
> >> +   In future, when GPU page faults are supported, we can potentially use a
> >> +   rwsem instead, so that multiple page fault handlers can take the read side
> >> +   lock to lookup the mapping and hence can run in parallel.
> >> +   The older execbuff mode of binding do not need this lock.
> >> +
> >> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs
> to
> >> +   be held while binding/unbinding a vma in the async worker and while
> updating
> >> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> >> +   share a dma-resv object.
> >> +
> >> +   The future system allocator support will use the HMM prescribed locking
> >> +   instead.
> >> +
> >> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> >> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> >> +
> >> +When GPU page faults are supported, the execbuff path do not take any of
> >> these
> >> +locks. There we will simply smash the new batch buffer address into the ring
> >> and
> >> +then tell the scheduler run that. The lock taking only happens from the page
> >> +fault handler, where we take lock-A in read mode, whichever lock-B we
> need to
> >> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm
> for
> >> +system allocator) and some additional locks (lock-D) for taking care of page
> >> +table races. Page fault mode should not need to ever manipulate the vm
> lists,
> >> +so won't ever need lock-C.
> >> +
> >> +VM_BIND LRU handling
> >> +---------------------
> >> +We need to ensure VM_BIND mapped objects are properly LRU tagged to
> avoid
> >> +performance degradation. We will also need support for bulk LRU movement
> of
> >> +VM_BIND objects to avoid additional latencies in execbuff path.
> >> +
> >> +The page table pages are similar to VM_BIND mapped objects (See
> >> +`Evictable page table allocations`_) and are maintained per VM and needs to
> >> +be pinned in memory when VM is made active (ie., upon an execbuff call
> with
> >> +that VM). So, bulk LRU movement of page table pages is also needed.
> >> +
> >> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> >> +over to the ttm LRU in some fashion to make sure we once again have a
> >> reasonable
> >> +and consistent memory aging and reclaim architecture.
> >> +
> >> +VM_BIND dma_resv usage
> >> +-----------------------
> >> +Fences needs to be added to all VM_BIND mapped objects. During each
> >> execbuff
> >> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> >> prevent
> >> +over sync (See enum dma_resv_usage). One can override it with either
> >> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during
> object
> >> dependency
> >> +setting (either through explicit or implicit mechanism).
> >> +
> >> +When vm_bind is called for a non-private object while the VM is already
> >> +active, the fences need to be copied from VM's shared dma-resv object
> >> +(common to all private objects of the VM) to this non-private object.
> >> +If this results in performance degradation, then some optimization will
> >> +be needed here. This is not a problem for VM's private objects as they use
> >> +shared dma-resv object which is always updated on each execbuff
> submission.
> >> +
> >> +Also, in VM_BIND mode, use dma-resv apis for determining object
> activeness
> >> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not
> use
> >> the
> >> +older i915_vma active reference tracking which is deprecated. This should be
> >> +easier to get it working with the current TTM backend. We can remove the
> >> +i915_vma active reference tracking fully while supporting TTM backend for
> igfx.
> >> +
> >> +Evictable page table allocations
> >> +---------------------------------
> >> +Make pagetable allocations evictable and manage them similar to VM_BIND
> >> +mapped objects. Page table pages are similar to persistent mappings of a
> >> +VM (difference here are that the page table pages will not have an i915_vma
> >> +structure and after swapping pages back in, parent page link needs to be
> >> +updated).
> >> +
> >> +Mesa use case
> >> +--------------
> >> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan
> and
> >> Iris),
> >> +hence improving performance of CPU-bound applications. It also allows us to
> >> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> >> performance,
> >> +reducing CPU overhead becomes more impactful.
> >> +
> >> +
> >> +VM_BIND Compute support
> >> +========================
> >> +
> >> +User/Memory Fence
> >> +------------------
> >> +The idea is to take a user specified virtual address and install an interrupt
> >> +handler to wake up the current task when the memory location passes the
> user
> >> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> >> +user fence, specified value will be written at the specified virtual address
> >> +and wakeup the waiting process. User can wait on a user fence with the
> >> +gem_wait_user_fence ioctl.
> >> +
> >> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> >> +interrupt within their batches after updating the value to have sub-batch
> >> +precision on the wakeup. Each batch can signal a user fence to indicate
> >> +the completion of next level batch. The completion of very first level batch
> >> +needs to be signaled by the command streamer. The user must provide the
> >> +user/memory fence for this via the
> >> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> >> +extension of execbuff ioctl, so that KMD can setup the command streamer
> to
> >> +signal it.
> >> +
> >> +User/Memory fence can also be supplied to the kernel driver to signal/wake
> up
> >> +the user process after completion of an asynchronous operation.
> >> +
> >> +When VM_BIND ioctl was provided with a user/memory fence via the
> >> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> >> completion
> >> +of binding of that mapping. All async binds/unbinds are serialized, hence
> >> +signaling of user/memory fence also indicate the completion of all previous
> >> +binds/unbinds.
> >> +
> >> +This feature will be derived from the below original work:
> >> +https://patchwork.freedesktop.org/patch/349417/
> >> +
> >> +Long running Compute contexts
> >> +------------------------------
> >> +Usage of dma-fence expects that they complete in reasonable amount of
> time.
> >> +Compute on the other hand can be long running. Hence it is appropriate for
> >> +compute to use user/memory fence and dma-fence usage will be limited to
> >> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> >> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute
> must
> >> opt-in
> >> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> flag
> >> during
> >> +context creation. The dma-fence based user interfaces like gem_wait ioctl
> and
> >> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> >> +not valid as well and is anyway not supported in VM_BIND mode.
> >> +
> >> +Where GPU page faults are not available, kernel driver upon buffer
> invalidation
> >> +will initiate a suspend (preemption) of long running context with a dma-
> fence
> >> +attached to it. And upon completion of that suspend fence, finish the
> >> +invalidation, revalidate the BO and then resume the compute context. This is
> >> +done by having a per-context preempt fence (also called suspend fence)
> >> proxying
> >> +as i915_request fence. This suspend fence is enabled when someone tries to
> >> wait
> >> +on it, which then triggers the context preemption.
> >> +
> >> +As this support for context suspension using a preempt fence and the
> resume
> >> work
> >> +for the compute mode contexts can get tricky to get it right, it is better to
> >> +add this support in drm scheduler so that multiple drivers can make use of it.
> >> +That means, it will have a dependency on i915 drm scheduler conversion with
> >> GuC
> >> +scheduler backend. This should be fine, as the plan is to support compute
> mode
> >> +contexts only with GuC scheduler backend (at least initially). This is much
> >> +easier to support with VM_BIND mode compared to the current heavier
> >> execbuff
> >> +path resource attachment.
> >> +
> >> +Low Latency Submission
> >> +-----------------------
> >> +Allows compute UMD to directly submit GPU jobs instead of through
> execbuff
> >> +ioctl. This is made possible by VM_BIND is not being synchronized against
> >> +execbuff. VM_BIND allows bind/unbind of mappings required for the
> directly
> >> +submitted jobs.
> >> +
> >> +Other VM_BIND use cases
> >> +========================
> >> +
> >> +Debugger
> >> +---------
> >> +With debug event interface user space process (debugger) is able to keep
> track
> >> +of and act upon resources created by another process (debugged) and
> attached
> >> +to GPU via vm_bind interface.
> >> +
> >> +GPU page faults
> >> +----------------
> >> +GPU page faults when supported (in future), will only be supported in the
> >> +VM_BIND mode. While both the older execbuff mode and the newer
> VM_BIND
> >> mode of
> >> +binding will require using dma-fence to ensure residency, the GPU page
> faults
> >> +mode when supported, will not use any dma-fence as residency is purely
> >> managed
> >> +by installing and removing/invalidating page table entries.
> >> +
> >> +Page level hints settings
> >> +--------------------------
> >> +VM_BIND allows any hints setting per mapping instead of per BO.
> >> +Possible hints include read-only mapping, placement and atomicity.
> >> +Sub-BO level placement hint will be even more relevant with
> >> +upcoming GPU on-demand page fault support.
> >> +
> >> +Page level Cache/CLOS settings
> >> +-------------------------------
> >> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> >> +
> >> +Shared Virtual Memory (SVM) support
> >> +------------------------------------
> >> +VM_BIND interface can be used to map system memory directly (without
> gem
> >> BO
> >> +abstraction) using the HMM interface. SVM is only supported with GPU page
> >> +faults enabled.
> >> +
> >> +
> >> +Broder i915 cleanups
> >> +=====================
> >> +Supporting this whole new vm_bind mode of binding which comes with its
> own
> >> +use cases to support and the locking requirements requires proper
> integration
> >> +with the existing i915 driver. This calls for some broader i915 driver
> >> +cleanups/simplifications for maintainability of the driver going forward.
> >> +Here are few things identified and are being looked into.
> >> +
> >> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> >> feature
> >> +  do not use it and complexity it brings in is probably more than the
> >> +  performance advantage we get in legacy execbuff case.
> >> +- Remove vma->open_count counting
> >> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> >> using
> >> +  it. Instead use underlying BO's dma-resv fence list to determine if a
> i915_vma
> >> +  is active or not.
> >> +
> >> +
> >> +VM_BIND UAPI
> >> +=============
> >> +
> >> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> >> diff --git a/Documentation/gpu/rfc/index.rst
> b/Documentation/gpu/rfc/index.rst
> >> index 91e93a705230..7d10c36b268d 100644
> >> --- a/Documentation/gpu/rfc/index.rst
> >> +++ b/Documentation/gpu/rfc/index.rst
> >> @@ -23,3 +23,7 @@ host such documentation:
> >>  .. toctree::
> >>
> >>      i915_scheduler.rst
> >> +
> >> +.. toctree::
> >> +
> >> +    i915_vm_bind.rst
> >> --
> >> 2.21.0.rc0.32.g243a4c7e27
> >

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-05-17 18:32   ` Niranjana Vishwanathapura
  (?)
  (?)
@ 2022-06-07 10:27   ` Tvrtko Ursulin
  2022-06-07 19:37     ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-07 10:27 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, intel-gfx, dri-devel, daniel.vetter
  Cc: thomas.hellstrom, christian.koenig, chris.p.wilson


On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
> VM_BIND and related uapi definitions
> 
> v2: Ensure proper kernel-doc formatting with cross references.
>      Also add new uapi and documentation as per review comments
>      from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>   Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>   1 file changed, 399 insertions(+)
>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..589c0a009107
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,399 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND		57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * A VM in VM_BIND mode will not support the older execbuff mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
> + * to pass in the batch buffer addresses.
> + *
> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
> + */
> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
> +
> +/**
> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> + *
> + * Flag to declare context as long running.
> + * See struct drm_i915_gem_context_create_ext flags.
> + *
> + * Usage of dma-fence expects that they complete in reasonable amount of time.
> + * Compute on the other hand can be long running. Hence it is not appropriate
> + * for compute contexts to export request completion dma-fence to user.
> + * The dma-fence usage will be limited to in-kernel consumption only.
> + * Compute contexts need to use user/memory fence.
> + *
> + * So, long running contexts do not support output fences. Hence,
> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
> + * to be not used.
> + *
> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
> + * to long running contexts.
> + */
> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
> +
> +/* VM_BIND related ioctls */
> +#define DRM_I915_GEM_VM_BIND		0x3d
> +#define DRM_I915_GEM_VM_UNBIND		0x3e
> +#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
> +
> +#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
> +
> +/**
> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
> + *
> + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
> + * virtual address (VA) range to the section of an object that should be bound
> + * in the device page table of the specified address space (VM).
> + * The VA range specified must be unique (ie., not currently bound) and can
> + * be mapped to whole object or a section of the object (partial binding).
> + * Multiple VA mappings can be created to the same section of the object
> + * (aliasing).
> + */
> +struct drm_i915_gem_vm_bind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @handle: Object handle */
> +	__u32 handle;
> +
> +	/** @start: Virtual Address start to bind */
> +	__u64 start;
> +
> +	/** @offset: Offset in object to bind */
> +	__u64 offset;
> +
> +	/** @length: Length of mapping to bind */
> +	__u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or 
if not userspace is expected to map the remainder of the space to a 
dummy object? In which case would there be any alignment/padding issues 
preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem 
with their strategy of dealing with overfetch and I suggested pad to size.

Regards,

Tvrtko

> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_GEM_VM_BIND_READONLY:
> +	 * Mapping is read-only.
> +	 *
> +	 * I915_GEM_VM_BIND_CAPTURE:
> +	 * Capture this mapping in the dump upon GPU error.
> +	 */
> +	__u64 flags;
> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
> + *
> + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
> + * address (VA) range that should be unbound from the device page table of the
> + * specified address space (VM). The specified VA range must match one of the
> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
> + * completion.
> + */
> +struct drm_i915_gem_vm_unbind {
> +	/** @vm_id: VM (address space) id to bind */
> +	__u32 vm_id;
> +
> +	/** @rsvd: Reserved for future use; must be zero. */
> +	__u32 rsvd;
> +
> +	/** @start: Virtual Address start to unbind */
> +	__u64 start;
> +
> +	/** @length: Length of mapping to unbind */
> +	__u64 length;
> +
> +	/** @flags: reserved for future usage, currently MBZ */
> +	__u64 flags;
> +
> +	/** @extensions: 0-terminated chain of extensions for this mapping. */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
> + * or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for input fence to signal
> + * before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the returned output fence
> + * after the completion of binding or unbinding.
> + */
> +struct drm_i915_vm_bind_fence {
> +	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
> +	__u32 handle;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
> + * and vm_unbind.
> + *
> + * This structure describes an array of timeline drm_syncobj and associated
> + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
> + */
> +struct drm_i915_vm_bind_ext_timeline_fences {
> +#define I915_VM_BIND_EXT_timeline_FENCES	0
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
> +	 * arrays.
> +	 */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
> +	 * of length @fence_count.
> +	 */
> +	__u64 handles_ptr;
> +
> +	/**
> +	 * @values_ptr: Pointer to an array of u64 values of length
> +	 * @fence_count.
> +	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
> +	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +	 * binary one.
> +	 */
> +	__u64 values_ptr;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
> + * vm_bind or the vm_unbind work.
> + *
> + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
> + * @addr to become equal to @val) before starting the binding or unbinding.
> + *
> + * The vm_bind or vm_unbind async worker will signal the output fence after
> + * the completion of binding or unbinding by writing @val to memory location at
> + * @addr
> + */
> +struct drm_i915_vm_bind_user_fence {
> +	/** @addr: User/Memory fence qword aligned process virtual address */
> +	__u64 addr;
> +
> +	/** @val: User/Memory fence value to be written after bind completion */
> +	__u64 val;
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_WAIT:
> +	 * Wait for the input fence before binding/unbinding
> +	 *
> +	 * I915_VM_BIND_USER_FENCE_SIGNAL:
> +	 * Return bind/unbind completion fence as output
> +	 */
> +	__u32 flags;
> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
> +	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
> + * and vm_unbind.
> + *
> + * These user fences can be input or output fences
> + * (See struct drm_i915_vm_bind_user_fence).
> + */
> +struct drm_i915_vm_bind_ext_user_fence {
> +#define I915_VM_BIND_EXT_USER_FENCES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @fence_count: Number of elements in the @user_fence_ptr array. */
> +	__u64 fence_count;
> +
> +	/**
> +	 * @user_fence_ptr: Pointer to an array of
> +	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
> +	 */
> +	__u64 user_fence_ptr;
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
> + * gpu virtual addresses.
> + *
> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
> + * must always be appended in the VM_BIND mode and it will be an error to
> + * append this extension in older non-VM_BIND mode.
> + */
> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @count: Number of addresses in the addr array. */
> +	__u32 count;
> +
> +	/** @addr: An array of batch gpu virtual addresses. */
> +	__u64 addr[0];
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
> + * signaling extension.
> + *
> + * This extension allows user to attach a user fence (@addr, @value pair) to an
> + * execbuf to be signaled by the command streamer after the completion of first
> + * level batch, by writing the @value at specified @addr and triggering an
> + * interrupt.
> + * User can either poll for this user fence to signal or can also wait on it
> + * with i915_gem_wait_user_fence ioctl.
> + * This is very much usefaul for long running contexts where waiting on dma-fence
> + * by user (like i915_gem_wait ioctl) is not supported.
> + */
> +struct drm_i915_gem_execbuffer_ext_user_fence {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * @addr: User/Memory fence qword aligned GPU virtual address.
> +	 *
> +	 * Address has to be a valid GPU virtual address at the time of
> +	 * first level batch completion.
> +	 */
> +	__u64 addr;
> +
> +	/**
> +	 * @value: User/Memory fence Value to be written to above address
> +	 * after first level batch completes.
> +	 */
> +	__u64 value;
> +
> +	/** @rsvd: Reserved for future extensions, MBZ */
> +	__u64 rsvd;
> +};
> +
> +/**
> + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
> + * private to the specified VM.
> + *
> + * See struct drm_i915_gem_create_ext.
> + */
> +struct drm_i915_gem_create_ext_vm_private {
> +#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** @vm_id: Id of the VM to which the object is private */
> +	__u32 vm_id;
> +};
> +
> +/**
> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
> + *
> + * User/Memory fence can be woken up either by:
> + *
> + * 1. GPU context indicated by @ctx_id, or,
> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
> + *    @ctx_id is ignored when this flag is set.
> + *
> + * Wakeup condition is,
> + * ``((*addr & mask) op (value & mask))``
> + *
> + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
> + */
> +struct drm_i915_gem_wait_user_fence {
> +	/** @extensions: Zero-terminated chain of extensions. */
> +	__u64 extensions;
> +
> +	/** @addr: User/Memory fence address */
> +	__u64 addr;
> +
> +	/** @ctx_id: Id of the Context which will signal the fence. */
> +	__u32 ctx_id;
> +
> +	/** @op: Wakeup condition operator */
> +	__u16 op;
> +#define I915_UFENCE_WAIT_EQ      0
> +#define I915_UFENCE_WAIT_NEQ     1
> +#define I915_UFENCE_WAIT_GT      2
> +#define I915_UFENCE_WAIT_GTE     3
> +#define I915_UFENCE_WAIT_LT      4
> +#define I915_UFENCE_WAIT_LTE     5
> +#define I915_UFENCE_WAIT_BEFORE  6
> +#define I915_UFENCE_WAIT_AFTER   7
> +
> +	/**
> +	 * @flags: Supported flags are,
> +	 *
> +	 * I915_UFENCE_WAIT_SOFT:
> +	 *
> +	 * To be woken up by i915 driver async worker (not by GPU).
> +	 *
> +	 * I915_UFENCE_WAIT_ABSTIME:
> +	 *
> +	 * Wait timeout specified as absolute time.
> +	 */
> +	__u16 flags;
> +#define I915_UFENCE_WAIT_SOFT    0x1
> +#define I915_UFENCE_WAIT_ABSTIME 0x2
> +
> +	/** @value: Wakeup value */
> +	__u64 value;
> +
> +	/** @mask: Wakeup mask */
> +	__u64 mask;
> +#define I915_UFENCE_WAIT_U8     0xffu
> +#define I915_UFENCE_WAIT_U16    0xffffu
> +#define I915_UFENCE_WAIT_U32    0xfffffffful
> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
> +
> +	/**
> +	 * @timeout: Wait timeout in nanoseconds.
> +	 *
> +	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
> +	 * absolute time in nsec.
> +	 */
> +	__s64 timeout;
> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-03  6:53             ` Niranjana Vishwanathapura
@ 2022-06-07 10:42               ` Tvrtko Ursulin
  2022-06-07 21:25                 ` Niranjana Vishwanathapura
  2022-06-08  6:40               ` Lionel Landwerlin
  2022-06-08  7:12               ` Lionel Landwerlin
  2 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-07 10:42 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig


On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>> >> VM_BIND and related uapi definitions
>>>>> >>
>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>> >>     Also add new uapi and documentation as per review comments
>>>>> >>     from Daniel.
>>>>> >>
>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>> <niranjana.vishwanathapura@intel.com>
>>>>> >> ---
>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>> +++++++++++++++++++++++++++
>>>>> >>  1 file changed, 399 insertions(+)
>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >>
>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> new file mode 100644
>>>>> >> index 000000000000..589c0a009107
>>>>> >> --- /dev/null
>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> @@ -0,0 +1,399 @@
>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>> >> +/*
>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>> >> + */
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>> >> + *
>>>>> >> + * VM_BIND feature availability.
>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>> >> + */
>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>> >> + *
>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>> >> + *
>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>> mode of binding.
>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>>>> (ie., the
>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>> be provided
>>>>> >> + * to pass in the batch buffer addresses.
>>>>> >> + *
>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>> must be 0
>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>> must always be
>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>> batch_len fields
>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>> must be 0.
>>>>> >> + */
>>>>> >
>>>>> >From that description, it seems we have:
>>>>> >
>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>> extensions
>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>> actual pointer!
>>>>> >        __u64 flags;                    -> some flags must be 0 (new)
>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>> >        __u64 rsvd2;                    -> unused
>>>>> >};
>>>>> >
>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>>>>> >of adding even more complexity to an already abused interface? While
>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>> >changing how the base struct should be interpreted based on how 
>>>>> the VM
>>>>> >was created (which is an entirely different ioctl).
>>>>> >
>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>> >already at -6 without these changes. I think after vm_bind we'll need
>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>> >
>>>>>
>>>>> The only change here is removing the execlist support for VM_BIND
>>>>> mode (other than natual extensions).
>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>> careful
>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>> future
>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>
>>>> Why not? it's not like adding extensions here is really that different
>>>> than adding new ioctls.
>>>>
>>>> I definitely think this deserves an execbuffer3 without even
>>>> considering future requirements. Just  to burn down the old
>>>> requirements and pointless fields.
>>>>
>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>> older sw on execbuf2 for ever.
>>>
>>> I guess another point in favour of execbuf3 would be that it's less
>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>> needed to cleanly split out the vm_bind paths from the legacy
>>> reloc/softping paths.
>>>
>>> If we invert this and do execbuf3, then there's the existing ioctl
>>> vfunc, and then we share code (where it even makes sense, probably
>>> request setup/submit need to be shared, anything else is probably
>>> cleaner to just copypaste) with the usual helper approach.
>>>
>>> Also that would guarantee that really none of the old concepts like
>>> i915_active on the vma or vma open counts and all that stuff leaks
>>> into the new vm_bind execbuf.
>>>
>>> Finally I also think that copypasting would make backporting easier,
>>> or at least more flexible, since it should make it easier to have the
>>> upstream vm_bind co-exist with all the other things we have. Without
>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>> of vfuncs into the existing code would cause.
>>>
>>> So maybe we should do this?
>>
>> Thanks Dave, Daniel.
>> There are a few things that will be common between execbuf2 and
>> execbuf3, like request setup/submit (as you said), fence handling 
>> (timeline fences, fence array, composite fences), engine selection,
>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>> bit position will differ).
>> But I guess these should be fine as the suggestion here is to
>> copy-paste the execbuff code and having a shared code where possible.
>> Besides, we can stop supporting some older feature in execbuff3
>> (like fence array in favor of newer timeline fences), which will
>> further reduce common code.
>>
>> Ok, I will update this series by adding execbuf3 and send out soon.
>>
> 
> Does this sound reasonable?
> 
> struct drm_i915_gem_execbuffer3 {
>         __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
> 
>         __u32 batch_count;
>         __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
> virtual addresses */

Casual stumble upon..

Alternatively you could embed N pointers to make life a bit easier for 
both userspace and kernel side. Yes, but then "N batch buffers should be 
enough for everyone" problem.. :)

> 
>         __u64 flags;
> #define I915_EXEC3_RING_MASK              (0x3f)
> #define I915_EXEC3_DEFAULT                (0<<0)
> #define I915_EXEC3_RENDER                 (1<<0)
> #define I915_EXEC3_BSD                    (2<<0)
> #define I915_EXEC3_BLT                    (3<<0)
> #define I915_EXEC3_VEBOX                  (4<<0)
> 
> #define I915_EXEC3_SECURE               (1<<6)
> #define I915_EXEC3_IS_PINNED            (1<<7)
> 
> #define I915_EXEC3_BSD_SHIFT     (8)
> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

I'd suggest legacy engine selection is unwanted, especially not with the 
convoluted BSD1/2 flags. Can we just require context with engine map and 
index? Or if default context has to be supported then I'd suggest 
...class_instance for that mode.

> #define I915_EXEC3_FENCE_IN             (1<<10)
> #define I915_EXEC3_FENCE_OUT            (1<<11)
> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)

People are likely to object to submit fence since generic mechanism to 
align submissions was rejected.

> 
>         __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

New ioctl you can afford dedicated fields.

In any case I suggest you involve UMD folks in designing it.

Regards,

Tvrtko

> 
>         __u64 extensions;        /* currently only for 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> };
> 
> With this, user can pass in batch addresses and count directly,
> instead of as an extension (as this rfc series was proposing).
> 
> I have removed many of the flags which were either legacy or not
> applicable to BM_BIND mode.
> I have also removed fence array support (execbuffer2.cliprects_ptr)
> as we have timeline fence array support. Is that fine?
> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
> 
> Any thing else needs to be added or removed?
> 
> Niranjana
> 
>> Niranjana
>>
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-03 23:51               ` Niranjana Vishwanathapura
@ 2022-06-07 17:12                 ` Jason Ekstrand
  -1 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-07 17:12 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Lionel Landwerlin, Daniel Vetter,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 11999 bytes --]

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >
> >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >     <niranjana.vishwanathapura@intel.com> wrote:
> >
> >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
> >       the mapping in an
> >       >> > +async worker. The binding and unbinding will work like a
> special
> >       GPU engine.
> >       >> > +The binding and unbinding operations are serialized and will
> >       wait on specified
> >       >> > +input fences before the operation and will signal the output
> >       fences upon the
> >       >> > +completion of the operation. Due to serialization,
> completion of
> >       an operation
> >       >> > +will also indicate that all previous operations are also
> >       complete.
> >       >>
> >       >> I guess we should avoid saying "will immediately start
> >       binding/unbinding" if
> >       >> there are fences involved.
> >       >>
> >       >> And the fact that it's happening in an async worker seem to
> imply
> >       it's not
> >       >> immediate.
> >       >>
> >
> >       Ok, will fix.
> >       This was added because in earlier design binding was deferred until
> >       next execbuff.
> >       But now it is non-deferred (immediate in that sense). But yah,
> this is
> >       confusing
> >       and will fix it.
> >
> >       >>
> >       >> I have a question on the behavior of the bind operation when no
> >       input fence
> >       >> is provided. Let say I do :
> >       >>
> >       >> VM_BIND (out_fence=fence1)
> >       >>
> >       >> VM_BIND (out_fence=fence2)
> >       >>
> >       >> VM_BIND (out_fence=fence3)
> >       >>
> >       >>
> >       >> In what order are the fences going to be signaled?
> >       >>
> >       >> In the order of VM_BIND ioctls? Or out of order?
> >       >>
> >       >> Because you wrote "serialized I assume it's : in order
> >       >>
> >
> >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
> unbind
> >       will use
> >       the same queue and hence are ordered.
> >
> >       >>
> >       >> One thing I didn't realize is that because we only get one
> >       "VM_BIND" engine,
> >       >> there is a disconnect from the Vulkan specification.
> >       >>
> >       >> In Vulkan VM_BIND operations are serialized but per engine.
> >       >>
> >       >> So you could have something like this :
> >       >>
> >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >       >>
> >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >       >>
> >       >>
> >       >> fence1 is not signaled
> >       >>
> >       >> fence3 is signaled
> >       >>
> >       >> So the second VM_BIND will proceed before the first VM_BIND.
> >       >>
> >       >>
> >       >> I guess we can deal with that scenario in userspace by doing the
> >       wait
> >       >> ourselves in one thread per engines.
> >       >>
> >       >> But then it makes the VM_BIND input fences useless.
> >       >>
> >       >>
> >       >> Daniel : what do you think? Should be rework this or just deal
> with
> >       wait
> >       >> fences in userspace?
> >       >>
> >       >
> >       >My opinion is rework this but make the ordering via an engine
> param
> >       optional.
> >       >
> >       >e.g. A VM can be configured so all binds are ordered within the VM
> >       >
> >       >e.g. A VM can be configured so all binds accept an engine argument
> >       (in
> >       >the case of the i915 likely this is a gem context handle) and
> binds
> >       >ordered with respect to that engine.
> >       >
> >       >This gives UMDs options as the later likely consumes more KMD
> >       resources
> >       >so if a different UMD can live with binds being ordered within
> the VM
> >       >they can use a mode consuming less resources.
> >       >
> >
> >       I think we need to be careful here if we are looking for some out
> of
> >       (submission) order completion of vm_bind/unbind.
> >       In-order completion means, in a batch of binds and unbinds to be
> >       completed in-order, user only needs to specify in-fence for the
> >       first bind/unbind call and the our-fence for the last bind/unbind
> >       call. Also, the VA released by an unbind call can be re-used by
> >       any subsequent bind call in that in-order batch.
> >
> >       These things will break if binding/unbinding were to be allowed to
> >       go out of order (of submission) and user need to be extra careful
> >       not to run into pre-mature triggereing of out-fence and bind
> failing
> >       as VA is still in use etc.
> >
> >       Also, VM_BIND binds the provided mapping on the specified address
> >       space
> >       (VM). So, the uapi is not engine/context specific.
> >
> >       We can however add a 'queue' to the uapi which can be one from the
> >       pre-defined queues,
> >       I915_VM_BIND_QUEUE_0
> >       I915_VM_BIND_QUEUE_1
> >       ...
> >       I915_VM_BIND_QUEUE_(N-1)
> >
> >       KMD will spawn an async work queue for each queue which will only
> >       bind the mappings on that queue in the order of submission.
> >       User can assign the queue to per engine or anything like that.
> >
> >       But again here, user need to be careful and not deadlock these
> >       queues with circular dependency of fences.
> >
> >       I prefer adding this later an as extension based on whether it
> >       is really helping with the implementation.
> >
> >     I can tell you right now that having everything on a single in-order
> >     queue will not get us the perf we want.  What vulkan really wants is
> one
> >     of two things:
> >      1. No implicit ordering of VM_BIND ops.  They just happen in
> whatever
> >     their dependencies are resolved and we ensure ordering ourselves by
> >     having a syncobj in the VkQueue.
> >      2. The ability to create multiple VM_BIND queues.  We need at least
> 2
> >     but I don't see why there needs to be a limit besides the limits the
> >     i915 API already has on the number of engines.  Vulkan could expose
> >     multiple sparse binding queues to the client if it's not arbitrarily
> >     limited.
>
> Thanks Jason, Lionel.
>
> Jason, what are you referring to when you say "limits the i915 API already
> has on the number of engines"? I am not sure if there is such an uapi
> today.
>

There's a limit of something like 64 total engines today based on the
number of bits we can cram into the exec flags in execbuffer2.  I think
someone had an extended version that allowed more but I ripped it out
because no one was using it.  Of course, execbuffer3 might not have that
problem at all.

I am trying to see how many queues we need and don't want it to be
> arbitrarily
> large and unduely blow up memory usage and complexity in i915 driver.
>

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
could imagine a client wanting to create more than 1 sparse queue in which
case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
you allow two, I don't think the complexity is going up by allowing N.  As
for memory usage, creating more queues means more memory.  That's a
trade-off that userspace can make.  Again, the expected number here is 1 or
2 in the vast majority of cases so I don't think you need to worry.


> >     Why?  Because Vulkan has two basic kind of bind operations and we
> don't
> >     want any dependencies between them:
> >      1. Immediate.  These happen right after BO creation or maybe as
> part of
> >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
> >     queue and we don't want them serialized with anything.  To
> synchronize
> >     with submit, we'll have a syncobj in the VkDevice which is signaled
> by
> >     all immediate bind operations and make submits wait on it.
> >      2. Queued (sparse): These happen on a VkQueue which may be the same
> as
> >     a render/compute queue or may be its own queue.  It's up to us what
> we
> >     want to advertise.  From the Vulkan API PoV, this is like any other
> >     queue.  Operations on it wait on and signal semaphores.  If we have a
> >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
> we do
> >     in execbuf().
> >     The important thing is that we don't want one type of operation to
> block
> >     on the other.  If immediate binds are blocking on sparse binds, it's
> >     going to cause over-synchronization issues.
> >     In terms of the internal implementation, I know that there's going
> to be
> >     a lock on the VM and that we can't actually do these things in
> >     parallel.  That's fine.  Once the dma_fences have signaled and we're
>
> Thats correct. It is like a single VM_BIND engine with multiple queues
> feeding to it.
>

Right.  As long as the queues themselves are independent and can block on
dma_fences without holding up other queues, I think we're fine.


> >     unblocked to do the bind operation, I don't care if there's a bit of
> >     synchronization due to locking.  That's expected.  What we can't
> afford
> >     to have is an immediate bind operation suddenly blocking on a sparse
> >     operation which is blocked on a compute job that's going to run for
> >     another 5ms.
>
> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
> on other VMs. I am not sure about usecases here, but just wanted to
> clarify.
>

Yes, that's what I would expect.

--Jason



> Niranjana
>
> >     For reference, Windows solves this by allowing arbitrarily many
> paging
> >     queues (what they call a VM_BIND engine/queue).  That design works
> >     pretty well and solves the problems in question.  Again, we could
> just
> >     make everything out-of-order and require using syncobjs to order
> things
> >     as userspace wants. That'd be fine too.
> >     One more note while I'm here: danvet said something on IRC about
> VM_BIND
> >     queues waiting for syncobjs to materialize.  We don't really
> want/need
> >     this.  We already have all the machinery in userspace to handle
> >     wait-before-signal and waiting for syncobj fences to materialize and
> >     that machinery is on by default.  It would actually take MORE work in
> >     Mesa to turn it off and take advantage of the kernel being able to
> wait
> >     for syncobjs to materialize.  Also, getting that right is
> ridiculously
> >     hard and I really don't want to get it wrong in kernel space.  When
> we
> >     do memory fences, wait-before-signal will be a thing.  We don't need
> to
> >     try and make it a thing for syncobj.
> >     --Jason
> >
> >   Thanks Jason,
> >
> >   I missed the bit in the Vulkan spec that we're allowed to have a sparse
> >   queue that does not implement either graphics or compute operations :
> >
> >     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
> >     support in queue families that also include
> >
> >      graphics and compute support, other implementations may only expose
> a
> >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >
> >      family."
> >
> >   So it can all be all a vm_bind engine that just does bind/unbind
> >   operations.
> >
> >   But yes we need another engine for the immediate/non-sparse operations.
> >
> >   -Lionel
> >
> >
> >
> >       Daniel, any thoughts?
> >
> >       Niranjana
> >
> >       >Matt
> >       >
> >       >>
> >       >> Sorry I noticed this late.
> >       >>
> >       >>
> >       >> -Lionel
> >       >>
> >       >>
>

[-- Attachment #2: Type: text/html, Size: 15906 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-07 17:12                 ` Jason Ekstrand
  0 siblings, 0 replies; 121+ messages in thread
From: Jason Ekstrand @ 2022-06-07 17:12 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Daniel Vetter, Christian König

[-- Attachment #1: Type: text/plain, Size: 11999 bytes --]

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >
> >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >     <niranjana.vishwanathapura@intel.com> wrote:
> >
> >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
> >       the mapping in an
> >       >> > +async worker. The binding and unbinding will work like a
> special
> >       GPU engine.
> >       >> > +The binding and unbinding operations are serialized and will
> >       wait on specified
> >       >> > +input fences before the operation and will signal the output
> >       fences upon the
> >       >> > +completion of the operation. Due to serialization,
> completion of
> >       an operation
> >       >> > +will also indicate that all previous operations are also
> >       complete.
> >       >>
> >       >> I guess we should avoid saying "will immediately start
> >       binding/unbinding" if
> >       >> there are fences involved.
> >       >>
> >       >> And the fact that it's happening in an async worker seem to
> imply
> >       it's not
> >       >> immediate.
> >       >>
> >
> >       Ok, will fix.
> >       This was added because in earlier design binding was deferred until
> >       next execbuff.
> >       But now it is non-deferred (immediate in that sense). But yah,
> this is
> >       confusing
> >       and will fix it.
> >
> >       >>
> >       >> I have a question on the behavior of the bind operation when no
> >       input fence
> >       >> is provided. Let say I do :
> >       >>
> >       >> VM_BIND (out_fence=fence1)
> >       >>
> >       >> VM_BIND (out_fence=fence2)
> >       >>
> >       >> VM_BIND (out_fence=fence3)
> >       >>
> >       >>
> >       >> In what order are the fences going to be signaled?
> >       >>
> >       >> In the order of VM_BIND ioctls? Or out of order?
> >       >>
> >       >> Because you wrote "serialized I assume it's : in order
> >       >>
> >
> >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
> unbind
> >       will use
> >       the same queue and hence are ordered.
> >
> >       >>
> >       >> One thing I didn't realize is that because we only get one
> >       "VM_BIND" engine,
> >       >> there is a disconnect from the Vulkan specification.
> >       >>
> >       >> In Vulkan VM_BIND operations are serialized but per engine.
> >       >>
> >       >> So you could have something like this :
> >       >>
> >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >       >>
> >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >       >>
> >       >>
> >       >> fence1 is not signaled
> >       >>
> >       >> fence3 is signaled
> >       >>
> >       >> So the second VM_BIND will proceed before the first VM_BIND.
> >       >>
> >       >>
> >       >> I guess we can deal with that scenario in userspace by doing the
> >       wait
> >       >> ourselves in one thread per engines.
> >       >>
> >       >> But then it makes the VM_BIND input fences useless.
> >       >>
> >       >>
> >       >> Daniel : what do you think? Should be rework this or just deal
> with
> >       wait
> >       >> fences in userspace?
> >       >>
> >       >
> >       >My opinion is rework this but make the ordering via an engine
> param
> >       optional.
> >       >
> >       >e.g. A VM can be configured so all binds are ordered within the VM
> >       >
> >       >e.g. A VM can be configured so all binds accept an engine argument
> >       (in
> >       >the case of the i915 likely this is a gem context handle) and
> binds
> >       >ordered with respect to that engine.
> >       >
> >       >This gives UMDs options as the later likely consumes more KMD
> >       resources
> >       >so if a different UMD can live with binds being ordered within
> the VM
> >       >they can use a mode consuming less resources.
> >       >
> >
> >       I think we need to be careful here if we are looking for some out
> of
> >       (submission) order completion of vm_bind/unbind.
> >       In-order completion means, in a batch of binds and unbinds to be
> >       completed in-order, user only needs to specify in-fence for the
> >       first bind/unbind call and the our-fence for the last bind/unbind
> >       call. Also, the VA released by an unbind call can be re-used by
> >       any subsequent bind call in that in-order batch.
> >
> >       These things will break if binding/unbinding were to be allowed to
> >       go out of order (of submission) and user need to be extra careful
> >       not to run into pre-mature triggereing of out-fence and bind
> failing
> >       as VA is still in use etc.
> >
> >       Also, VM_BIND binds the provided mapping on the specified address
> >       space
> >       (VM). So, the uapi is not engine/context specific.
> >
> >       We can however add a 'queue' to the uapi which can be one from the
> >       pre-defined queues,
> >       I915_VM_BIND_QUEUE_0
> >       I915_VM_BIND_QUEUE_1
> >       ...
> >       I915_VM_BIND_QUEUE_(N-1)
> >
> >       KMD will spawn an async work queue for each queue which will only
> >       bind the mappings on that queue in the order of submission.
> >       User can assign the queue to per engine or anything like that.
> >
> >       But again here, user need to be careful and not deadlock these
> >       queues with circular dependency of fences.
> >
> >       I prefer adding this later an as extension based on whether it
> >       is really helping with the implementation.
> >
> >     I can tell you right now that having everything on a single in-order
> >     queue will not get us the perf we want.  What vulkan really wants is
> one
> >     of two things:
> >      1. No implicit ordering of VM_BIND ops.  They just happen in
> whatever
> >     their dependencies are resolved and we ensure ordering ourselves by
> >     having a syncobj in the VkQueue.
> >      2. The ability to create multiple VM_BIND queues.  We need at least
> 2
> >     but I don't see why there needs to be a limit besides the limits the
> >     i915 API already has on the number of engines.  Vulkan could expose
> >     multiple sparse binding queues to the client if it's not arbitrarily
> >     limited.
>
> Thanks Jason, Lionel.
>
> Jason, what are you referring to when you say "limits the i915 API already
> has on the number of engines"? I am not sure if there is such an uapi
> today.
>

There's a limit of something like 64 total engines today based on the
number of bits we can cram into the exec flags in execbuffer2.  I think
someone had an extended version that allowed more but I ripped it out
because no one was using it.  Of course, execbuffer3 might not have that
problem at all.

I am trying to see how many queues we need and don't want it to be
> arbitrarily
> large and unduely blow up memory usage and complexity in i915 driver.
>

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
could imagine a client wanting to create more than 1 sparse queue in which
case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
you allow two, I don't think the complexity is going up by allowing N.  As
for memory usage, creating more queues means more memory.  That's a
trade-off that userspace can make.  Again, the expected number here is 1 or
2 in the vast majority of cases so I don't think you need to worry.


> >     Why?  Because Vulkan has two basic kind of bind operations and we
> don't
> >     want any dependencies between them:
> >      1. Immediate.  These happen right after BO creation or maybe as
> part of
> >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
> >     queue and we don't want them serialized with anything.  To
> synchronize
> >     with submit, we'll have a syncobj in the VkDevice which is signaled
> by
> >     all immediate bind operations and make submits wait on it.
> >      2. Queued (sparse): These happen on a VkQueue which may be the same
> as
> >     a render/compute queue or may be its own queue.  It's up to us what
> we
> >     want to advertise.  From the Vulkan API PoV, this is like any other
> >     queue.  Operations on it wait on and signal semaphores.  If we have a
> >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
> we do
> >     in execbuf().
> >     The important thing is that we don't want one type of operation to
> block
> >     on the other.  If immediate binds are blocking on sparse binds, it's
> >     going to cause over-synchronization issues.
> >     In terms of the internal implementation, I know that there's going
> to be
> >     a lock on the VM and that we can't actually do these things in
> >     parallel.  That's fine.  Once the dma_fences have signaled and we're
>
> Thats correct. It is like a single VM_BIND engine with multiple queues
> feeding to it.
>

Right.  As long as the queues themselves are independent and can block on
dma_fences without holding up other queues, I think we're fine.


> >     unblocked to do the bind operation, I don't care if there's a bit of
> >     synchronization due to locking.  That's expected.  What we can't
> afford
> >     to have is an immediate bind operation suddenly blocking on a sparse
> >     operation which is blocked on a compute job that's going to run for
> >     another 5ms.
>
> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
> on other VMs. I am not sure about usecases here, but just wanted to
> clarify.
>

Yes, that's what I would expect.

--Jason



> Niranjana
>
> >     For reference, Windows solves this by allowing arbitrarily many
> paging
> >     queues (what they call a VM_BIND engine/queue).  That design works
> >     pretty well and solves the problems in question.  Again, we could
> just
> >     make everything out-of-order and require using syncobjs to order
> things
> >     as userspace wants. That'd be fine too.
> >     One more note while I'm here: danvet said something on IRC about
> VM_BIND
> >     queues waiting for syncobjs to materialize.  We don't really
> want/need
> >     this.  We already have all the machinery in userspace to handle
> >     wait-before-signal and waiting for syncobj fences to materialize and
> >     that machinery is on by default.  It would actually take MORE work in
> >     Mesa to turn it off and take advantage of the kernel being able to
> wait
> >     for syncobjs to materialize.  Also, getting that right is
> ridiculously
> >     hard and I really don't want to get it wrong in kernel space.  When
> we
> >     do memory fences, wait-before-signal will be a thing.  We don't need
> to
> >     try and make it a thing for syncobj.
> >     --Jason
> >
> >   Thanks Jason,
> >
> >   I missed the bit in the Vulkan spec that we're allowed to have a sparse
> >   queue that does not implement either graphics or compute operations :
> >
> >     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
> >     support in queue families that also include
> >
> >      graphics and compute support, other implementations may only expose
> a
> >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >
> >      family."
> >
> >   So it can all be all a vm_bind engine that just does bind/unbind
> >   operations.
> >
> >   But yes we need another engine for the immediate/non-sparse operations.
> >
> >   -Lionel
> >
> >
> >
> >       Daniel, any thoughts?
> >
> >       Niranjana
> >
> >       >Matt
> >       >
> >       >>
> >       >> Sorry I noticed this late.
> >       >>
> >       >>
> >       >> -Lionel
> >       >>
> >       >>
>

[-- Attachment #2: Type: text/html, Size: 15906 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-07 17:12                 ` Jason Ekstrand
@ 2022-06-07 18:18                   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 18:18 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Lionel Landwerlin, Daniel Vetter,
	Christian König

On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>   On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>     >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >
>     >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >     <niranjana.vishwanathapura@intel.com> wrote:
>     >
>     >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>     wrote:
>     >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding
>     >       the mapping in an
>     >       >> > +async worker. The binding and unbinding will work like a
>     special
>     >       GPU engine.
>     >       >> > +The binding and unbinding operations are serialized and
>     will
>     >       wait on specified
>     >       >> > +input fences before the operation and will signal the
>     output
>     >       fences upon the
>     >       >> > +completion of the operation. Due to serialization,
>     completion of
>     >       an operation
>     >       >> > +will also indicate that all previous operations are also
>     >       complete.
>     >       >>
>     >       >> I guess we should avoid saying "will immediately start
>     >       binding/unbinding" if
>     >       >> there are fences involved.
>     >       >>
>     >       >> And the fact that it's happening in an async worker seem to
>     imply
>     >       it's not
>     >       >> immediate.
>     >       >>
>     >
>     >       Ok, will fix.
>     >       This was added because in earlier design binding was deferred
>     until
>     >       next execbuff.
>     >       But now it is non-deferred (immediate in that sense). But yah,
>     this is
>     >       confusing
>     >       and will fix it.
>     >
>     >       >>
>     >       >> I have a question on the behavior of the bind operation when
>     no
>     >       input fence
>     >       >> is provided. Let say I do :
>     >       >>
>     >       >> VM_BIND (out_fence=fence1)
>     >       >>
>     >       >> VM_BIND (out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (out_fence=fence3)
>     >       >>
>     >       >>
>     >       >> In what order are the fences going to be signaled?
>     >       >>
>     >       >> In the order of VM_BIND ioctls? Or out of order?
>     >       >>
>     >       >> Because you wrote "serialized I assume it's : in order
>     >       >>
>     >
>     >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind
>     >       will use
>     >       the same queue and hence are ordered.
>     >
>     >       >>
>     >       >> One thing I didn't realize is that because we only get one
>     >       "VM_BIND" engine,
>     >       >> there is a disconnect from the Vulkan specification.
>     >       >>
>     >       >> In Vulkan VM_BIND operations are serialized but per engine.
>     >       >>
>     >       >> So you could have something like this :
>     >       >>
>     >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >       >>
>     >       >>
>     >       >> fence1 is not signaled
>     >       >>
>     >       >> fence3 is signaled
>     >       >>
>     >       >> So the second VM_BIND will proceed before the first VM_BIND.
>     >       >>
>     >       >>
>     >       >> I guess we can deal with that scenario in userspace by doing
>     the
>     >       wait
>     >       >> ourselves in one thread per engines.
>     >       >>
>     >       >> But then it makes the VM_BIND input fences useless.
>     >       >>
>     >       >>
>     >       >> Daniel : what do you think? Should be rework this or just
>     deal with
>     >       wait
>     >       >> fences in userspace?
>     >       >>
>     >       >
>     >       >My opinion is rework this but make the ordering via an engine
>     param
>     >       optional.
>     >       >
>     >       >e.g. A VM can be configured so all binds are ordered within the
>     VM
>     >       >
>     >       >e.g. A VM can be configured so all binds accept an engine
>     argument
>     >       (in
>     >       >the case of the i915 likely this is a gem context handle) and
>     binds
>     >       >ordered with respect to that engine.
>     >       >
>     >       >This gives UMDs options as the later likely consumes more KMD
>     >       resources
>     >       >so if a different UMD can live with binds being ordered within
>     the VM
>     >       >they can use a mode consuming less resources.
>     >       >
>     >
>     >       I think we need to be careful here if we are looking for some
>     out of
>     >       (submission) order completion of vm_bind/unbind.
>     >       In-order completion means, in a batch of binds and unbinds to be
>     >       completed in-order, user only needs to specify in-fence for the
>     >       first bind/unbind call and the our-fence for the last
>     bind/unbind
>     >       call. Also, the VA released by an unbind call can be re-used by
>     >       any subsequent bind call in that in-order batch.
>     >
>     >       These things will break if binding/unbinding were to be allowed
>     to
>     >       go out of order (of submission) and user need to be extra
>     careful
>     >       not to run into pre-mature triggereing of out-fence and bind
>     failing
>     >       as VA is still in use etc.
>     >
>     >       Also, VM_BIND binds the provided mapping on the specified
>     address
>     >       space
>     >       (VM). So, the uapi is not engine/context specific.
>     >
>     >       We can however add a 'queue' to the uapi which can be one from
>     the
>     >       pre-defined queues,
>     >       I915_VM_BIND_QUEUE_0
>     >       I915_VM_BIND_QUEUE_1
>     >       ...
>     >       I915_VM_BIND_QUEUE_(N-1)
>     >
>     >       KMD will spawn an async work queue for each queue which will
>     only
>     >       bind the mappings on that queue in the order of submission.
>     >       User can assign the queue to per engine or anything like that.
>     >
>     >       But again here, user need to be careful and not deadlock these
>     >       queues with circular dependency of fences.
>     >
>     >       I prefer adding this later an as extension based on whether it
>     >       is really helping with the implementation.
>     >
>     >     I can tell you right now that having everything on a single
>     in-order
>     >     queue will not get us the perf we want.  What vulkan really wants
>     is one
>     >     of two things:
>     >      1. No implicit ordering of VM_BIND ops.  They just happen in
>     whatever
>     >     their dependencies are resolved and we ensure ordering ourselves
>     by
>     >     having a syncobj in the VkQueue.
>     >      2. The ability to create multiple VM_BIND queues.  We need at
>     least 2
>     >     but I don't see why there needs to be a limit besides the limits
>     the
>     >     i915 API already has on the number of engines.  Vulkan could
>     expose
>     >     multiple sparse binding queues to the client if it's not
>     arbitrarily
>     >     limited.
>
>     Thanks Jason, Lionel.
>
>     Jason, what are you referring to when you say "limits the i915 API
>     already
>     has on the number of engines"? I am not sure if there is such an uapi
>     today.
>
>   There's a limit of something like 64 total engines today based on the
>   number of bits we can cram into the exec flags in execbuffer2.  I think
>   someone had an extended version that allowed more but I ripped it out
>   because no one was using it.  Of course, execbuffer3 might not have that
>   problem at all.
>

Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.

>     I am trying to see how many queues we need and don't want it to be
>     arbitrarily
>     large and unduely blow up memory usage and complexity in i915 driver.
>
>   I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>   could imagine a client wanting to create more than 1 sparse queue in which
>   case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>   you allow two, I don't think the complexity is going up by allowing N.  As
>   for memory usage, creating more queues means more memory.  That's a
>   trade-off that userspace can make.  Again, the expected number here is 1
>   or 2 in the vast majority of cases so I don't think you need to worry.
>    

Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.

Niranjana

>
>     >     Why?  Because Vulkan has two basic kind of bind operations and we
>     don't
>     >     want any dependencies between them:
>     >      1. Immediate.  These happen right after BO creation or maybe as
>     part of
>     >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>     on a
>     >     queue and we don't want them serialized with anything.  To
>     synchronize
>     >     with submit, we'll have a syncobj in the VkDevice which is
>     signaled by
>     >     all immediate bind operations and make submits wait on it.
>     >      2. Queued (sparse): These happen on a VkQueue which may be the
>     same as
>     >     a render/compute queue or may be its own queue.  It's up to us
>     what we
>     >     want to advertise.  From the Vulkan API PoV, this is like any
>     other
>     >     queue.  Operations on it wait on and signal semaphores.  If we
>     have a
>     >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>     we do
>     >     in execbuf().
>     >     The important thing is that we don't want one type of operation to
>     block
>     >     on the other.  If immediate binds are blocking on sparse binds,
>     it's
>     >     going to cause over-synchronization issues.
>     >     In terms of the internal implementation, I know that there's going
>     to be
>     >     a lock on the VM and that we can't actually do these things in
>     >     parallel.  That's fine.  Once the dma_fences have signaled and
>     we're
>
>     Thats correct. It is like a single VM_BIND engine with multiple queues
>     feeding to it.
>
>   Right.  As long as the queues themselves are independent and can block on
>   dma_fences without holding up other queues, I think we're fine.
>    
>
>     >     unblocked to do the bind operation, I don't care if there's a bit
>     of
>     >     synchronization due to locking.  That's expected.  What we can't
>     afford
>     >     to have is an immediate bind operation suddenly blocking on a
>     sparse
>     >     operation which is blocked on a compute job that's going to run
>     for
>     >     another 5ms.
>
>     As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>     VM_BIND
>     on other VMs. I am not sure about usecases here, but just wanted to
>     clarify.
>
>   Yes, that's what I would expect.
>   --Jason
>    
>
>     Niranjana
>
>     >     For reference, Windows solves this by allowing arbitrarily many
>     paging
>     >     queues (what they call a VM_BIND engine/queue).  That design works
>     >     pretty well and solves the problems in question.  Again, we could
>     just
>     >     make everything out-of-order and require using syncobjs to order
>     things
>     >     as userspace wants. That'd be fine too.
>     >     One more note while I'm here: danvet said something on IRC about
>     VM_BIND
>     >     queues waiting for syncobjs to materialize.  We don't really
>     want/need
>     >     this.  We already have all the machinery in userspace to handle
>     >     wait-before-signal and waiting for syncobj fences to materialize
>     and
>     >     that machinery is on by default.  It would actually take MORE work
>     in
>     >     Mesa to turn it off and take advantage of the kernel being able to
>     wait
>     >     for syncobjs to materialize.  Also, getting that right is
>     ridiculously
>     >     hard and I really don't want to get it wrong in kernel space. 
>     When we
>     >     do memory fences, wait-before-signal will be a thing.  We don't
>     need to
>     >     try and make it a thing for syncobj.
>     >     --Jason
>     >
>     >   Thanks Jason,
>     >
>     >   I missed the bit in the Vulkan spec that we're allowed to have a
>     sparse
>     >   queue that does not implement either graphics or compute operations
>     :
>     >
>     >     "While some implementations may include
>     VK_QUEUE_SPARSE_BINDING_BIT
>     >     support in queue families that also include
>     >
>     >      graphics and compute support, other implementations may only
>     expose a
>     >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >
>     >      family."
>     >
>     >   So it can all be all a vm_bind engine that just does bind/unbind
>     >   operations.
>     >
>     >   But yes we need another engine for the immediate/non-sparse
>     operations.
>     >
>     >   -Lionel
>     >
>     >     
>     >
>     >       Daniel, any thoughts?
>     >
>     >       Niranjana
>     >
>     >       >Matt
>     >       >
>     >       >>
>     >       >> Sorry I noticed this late.
>     >       >>
>     >       >>
>     >       >> -Lionel
>     >       >>
>     >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
@ 2022-06-07 18:18                   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 18:18 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Chris Wilson, Intel GFX, Maling list - DRI developers,
	Thomas Hellstrom, Daniel Vetter, Christian König

On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>   On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>     >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >
>     >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >     <niranjana.vishwanathapura@intel.com> wrote:
>     >
>     >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>     wrote:
>     >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding
>     >       the mapping in an
>     >       >> > +async worker. The binding and unbinding will work like a
>     special
>     >       GPU engine.
>     >       >> > +The binding and unbinding operations are serialized and
>     will
>     >       wait on specified
>     >       >> > +input fences before the operation and will signal the
>     output
>     >       fences upon the
>     >       >> > +completion of the operation. Due to serialization,
>     completion of
>     >       an operation
>     >       >> > +will also indicate that all previous operations are also
>     >       complete.
>     >       >>
>     >       >> I guess we should avoid saying "will immediately start
>     >       binding/unbinding" if
>     >       >> there are fences involved.
>     >       >>
>     >       >> And the fact that it's happening in an async worker seem to
>     imply
>     >       it's not
>     >       >> immediate.
>     >       >>
>     >
>     >       Ok, will fix.
>     >       This was added because in earlier design binding was deferred
>     until
>     >       next execbuff.
>     >       But now it is non-deferred (immediate in that sense). But yah,
>     this is
>     >       confusing
>     >       and will fix it.
>     >
>     >       >>
>     >       >> I have a question on the behavior of the bind operation when
>     no
>     >       input fence
>     >       >> is provided. Let say I do :
>     >       >>
>     >       >> VM_BIND (out_fence=fence1)
>     >       >>
>     >       >> VM_BIND (out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (out_fence=fence3)
>     >       >>
>     >       >>
>     >       >> In what order are the fences going to be signaled?
>     >       >>
>     >       >> In the order of VM_BIND ioctls? Or out of order?
>     >       >>
>     >       >> Because you wrote "serialized I assume it's : in order
>     >       >>
>     >
>     >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind
>     >       will use
>     >       the same queue and hence are ordered.
>     >
>     >       >>
>     >       >> One thing I didn't realize is that because we only get one
>     >       "VM_BIND" engine,
>     >       >> there is a disconnect from the Vulkan specification.
>     >       >>
>     >       >> In Vulkan VM_BIND operations are serialized but per engine.
>     >       >>
>     >       >> So you could have something like this :
>     >       >>
>     >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >       >>
>     >       >>
>     >       >> fence1 is not signaled
>     >       >>
>     >       >> fence3 is signaled
>     >       >>
>     >       >> So the second VM_BIND will proceed before the first VM_BIND.
>     >       >>
>     >       >>
>     >       >> I guess we can deal with that scenario in userspace by doing
>     the
>     >       wait
>     >       >> ourselves in one thread per engines.
>     >       >>
>     >       >> But then it makes the VM_BIND input fences useless.
>     >       >>
>     >       >>
>     >       >> Daniel : what do you think? Should be rework this or just
>     deal with
>     >       wait
>     >       >> fences in userspace?
>     >       >>
>     >       >
>     >       >My opinion is rework this but make the ordering via an engine
>     param
>     >       optional.
>     >       >
>     >       >e.g. A VM can be configured so all binds are ordered within the
>     VM
>     >       >
>     >       >e.g. A VM can be configured so all binds accept an engine
>     argument
>     >       (in
>     >       >the case of the i915 likely this is a gem context handle) and
>     binds
>     >       >ordered with respect to that engine.
>     >       >
>     >       >This gives UMDs options as the later likely consumes more KMD
>     >       resources
>     >       >so if a different UMD can live with binds being ordered within
>     the VM
>     >       >they can use a mode consuming less resources.
>     >       >
>     >
>     >       I think we need to be careful here if we are looking for some
>     out of
>     >       (submission) order completion of vm_bind/unbind.
>     >       In-order completion means, in a batch of binds and unbinds to be
>     >       completed in-order, user only needs to specify in-fence for the
>     >       first bind/unbind call and the our-fence for the last
>     bind/unbind
>     >       call. Also, the VA released by an unbind call can be re-used by
>     >       any subsequent bind call in that in-order batch.
>     >
>     >       These things will break if binding/unbinding were to be allowed
>     to
>     >       go out of order (of submission) and user need to be extra
>     careful
>     >       not to run into pre-mature triggereing of out-fence and bind
>     failing
>     >       as VA is still in use etc.
>     >
>     >       Also, VM_BIND binds the provided mapping on the specified
>     address
>     >       space
>     >       (VM). So, the uapi is not engine/context specific.
>     >
>     >       We can however add a 'queue' to the uapi which can be one from
>     the
>     >       pre-defined queues,
>     >       I915_VM_BIND_QUEUE_0
>     >       I915_VM_BIND_QUEUE_1
>     >       ...
>     >       I915_VM_BIND_QUEUE_(N-1)
>     >
>     >       KMD will spawn an async work queue for each queue which will
>     only
>     >       bind the mappings on that queue in the order of submission.
>     >       User can assign the queue to per engine or anything like that.
>     >
>     >       But again here, user need to be careful and not deadlock these
>     >       queues with circular dependency of fences.
>     >
>     >       I prefer adding this later an as extension based on whether it
>     >       is really helping with the implementation.
>     >
>     >     I can tell you right now that having everything on a single
>     in-order
>     >     queue will not get us the perf we want.  What vulkan really wants
>     is one
>     >     of two things:
>     >      1. No implicit ordering of VM_BIND ops.  They just happen in
>     whatever
>     >     their dependencies are resolved and we ensure ordering ourselves
>     by
>     >     having a syncobj in the VkQueue.
>     >      2. The ability to create multiple VM_BIND queues.  We need at
>     least 2
>     >     but I don't see why there needs to be a limit besides the limits
>     the
>     >     i915 API already has on the number of engines.  Vulkan could
>     expose
>     >     multiple sparse binding queues to the client if it's not
>     arbitrarily
>     >     limited.
>
>     Thanks Jason, Lionel.
>
>     Jason, what are you referring to when you say "limits the i915 API
>     already
>     has on the number of engines"? I am not sure if there is such an uapi
>     today.
>
>   There's a limit of something like 64 total engines today based on the
>   number of bits we can cram into the exec flags in execbuffer2.  I think
>   someone had an extended version that allowed more but I ripped it out
>   because no one was using it.  Of course, execbuffer3 might not have that
>   problem at all.
>

Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.

>     I am trying to see how many queues we need and don't want it to be
>     arbitrarily
>     large and unduely blow up memory usage and complexity in i915 driver.
>
>   I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>   could imagine a client wanting to create more than 1 sparse queue in which
>   case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>   you allow two, I don't think the complexity is going up by allowing N.  As
>   for memory usage, creating more queues means more memory.  That's a
>   trade-off that userspace can make.  Again, the expected number here is 1
>   or 2 in the vast majority of cases so I don't think you need to worry.
>    

Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.

Niranjana

>
>     >     Why?  Because Vulkan has two basic kind of bind operations and we
>     don't
>     >     want any dependencies between them:
>     >      1. Immediate.  These happen right after BO creation or maybe as
>     part of
>     >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>     on a
>     >     queue and we don't want them serialized with anything.  To
>     synchronize
>     >     with submit, we'll have a syncobj in the VkDevice which is
>     signaled by
>     >     all immediate bind operations and make submits wait on it.
>     >      2. Queued (sparse): These happen on a VkQueue which may be the
>     same as
>     >     a render/compute queue or may be its own queue.  It's up to us
>     what we
>     >     want to advertise.  From the Vulkan API PoV, this is like any
>     other
>     >     queue.  Operations on it wait on and signal semaphores.  If we
>     have a
>     >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>     we do
>     >     in execbuf().
>     >     The important thing is that we don't want one type of operation to
>     block
>     >     on the other.  If immediate binds are blocking on sparse binds,
>     it's
>     >     going to cause over-synchronization issues.
>     >     In terms of the internal implementation, I know that there's going
>     to be
>     >     a lock on the VM and that we can't actually do these things in
>     >     parallel.  That's fine.  Once the dma_fences have signaled and
>     we're
>
>     Thats correct. It is like a single VM_BIND engine with multiple queues
>     feeding to it.
>
>   Right.  As long as the queues themselves are independent and can block on
>   dma_fences without holding up other queues, I think we're fine.
>    
>
>     >     unblocked to do the bind operation, I don't care if there's a bit
>     of
>     >     synchronization due to locking.  That's expected.  What we can't
>     afford
>     >     to have is an immediate bind operation suddenly blocking on a
>     sparse
>     >     operation which is blocked on a compute job that's going to run
>     for
>     >     another 5ms.
>
>     As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>     VM_BIND
>     on other VMs. I am not sure about usecases here, but just wanted to
>     clarify.
>
>   Yes, that's what I would expect.
>   --Jason
>    
>
>     Niranjana
>
>     >     For reference, Windows solves this by allowing arbitrarily many
>     paging
>     >     queues (what they call a VM_BIND engine/queue).  That design works
>     >     pretty well and solves the problems in question.  Again, we could
>     just
>     >     make everything out-of-order and require using syncobjs to order
>     things
>     >     as userspace wants. That'd be fine too.
>     >     One more note while I'm here: danvet said something on IRC about
>     VM_BIND
>     >     queues waiting for syncobjs to materialize.  We don't really
>     want/need
>     >     this.  We already have all the machinery in userspace to handle
>     >     wait-before-signal and waiting for syncobj fences to materialize
>     and
>     >     that machinery is on by default.  It would actually take MORE work
>     in
>     >     Mesa to turn it off and take advantage of the kernel being able to
>     wait
>     >     for syncobjs to materialize.  Also, getting that right is
>     ridiculously
>     >     hard and I really don't want to get it wrong in kernel space. 
>     When we
>     >     do memory fences, wait-before-signal will be a thing.  We don't
>     need to
>     >     try and make it a thing for syncobj.
>     >     --Jason
>     >
>     >   Thanks Jason,
>     >
>     >   I missed the bit in the Vulkan spec that we're allowed to have a
>     sparse
>     >   queue that does not implement either graphics or compute operations
>     :
>     >
>     >     "While some implementations may include
>     VK_QUEUE_SPARSE_BINDING_BIT
>     >     support in queue families that also include
>     >
>     >      graphics and compute support, other implementations may only
>     expose a
>     >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >
>     >      family."
>     >
>     >   So it can all be all a vm_bind engine that just does bind/unbind
>     >   operations.
>     >
>     >   But yes we need another engine for the immediate/non-sparse
>     operations.
>     >
>     >   -Lionel
>     >
>     >     
>     >
>     >       Daniel, any thoughts?
>     >
>     >       Niranjana
>     >
>     >       >Matt
>     >       >
>     >       >>
>     >       >> Sorry I noticed this late.
>     >       >>
>     >       >>
>     >       >> -Lionel
>     >       >>
>     >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 10:27   ` Tvrtko Ursulin
@ 2022-06-07 19:37     ` Niranjana Vishwanathapura
  2022-06-08  7:17       ` Tvrtko Ursulin
  0 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 19:37 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig

On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>
>On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>VM_BIND and related uapi definitions
>>
>>v2: Ensure proper kernel-doc formatting with cross references.
>>     Also add new uapi and documentation as per review comments
>>     from Daniel.
>>
>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>>new file mode 100644
>>index 000000000000..589c0a009107
>>--- /dev/null
>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>@@ -0,0 +1,399 @@
>>+/* SPDX-License-Identifier: MIT */
>>+/*
>>+ * Copyright © 2022 Intel Corporation
>>+ */
>>+
>>+/**
>>+ * DOC: I915_PARAM_HAS_VM_BIND
>>+ *
>>+ * VM_BIND feature availability.
>>+ * See typedef drm_i915_getparam_t param.
>>+ */
>>+#define I915_PARAM_HAS_VM_BIND		57
>>+
>>+/**
>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>+ *
>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>+ * See struct drm_i915_gem_vm_control flags.
>>+ *
>>+ * A VM in VM_BIND mode will not support the older execbuff mode of binding.
>>+ * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
>>+ * to pass in the batch buffer addresses.
>>+ *
>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>+ * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>+ * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
>>+ * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>+ */
>>+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
>>+
>>+/**
>>+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>+ *
>>+ * Flag to declare context as long running.
>>+ * See struct drm_i915_gem_context_create_ext flags.
>>+ *
>>+ * Usage of dma-fence expects that they complete in reasonable amount of time.
>>+ * Compute on the other hand can be long running. Hence it is not appropriate
>>+ * for compute contexts to export request completion dma-fence to user.
>>+ * The dma-fence usage will be limited to in-kernel consumption only.
>>+ * Compute contexts need to use user/memory fence.
>>+ *
>>+ * So, long running contexts do not support output fences. Hence,
>>+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>+ * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
>>+ * to be not used.
>>+ *
>>+ * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
>>+ * to long running contexts.
>>+ */
>>+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>+
>>+/* VM_BIND related ioctls */
>>+#define DRM_I915_GEM_VM_BIND		0x3d
>>+#define DRM_I915_GEM_VM_UNBIND		0x3e
>>+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
>>+
>>+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>>+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
>>+
>>+/**
>>+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>+ *
>>+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
>>+ * virtual address (VA) range to the section of an object that should be bound
>>+ * in the device page table of the specified address space (VM).
>>+ * The VA range specified must be unique (ie., not currently bound) and can
>>+ * be mapped to whole object or a section of the object (partial binding).
>>+ * Multiple VA mappings can be created to the same section of the object
>>+ * (aliasing).
>>+ */
>>+struct drm_i915_gem_vm_bind {
>>+	/** @vm_id: VM (address space) id to bind */
>>+	__u32 vm_id;
>>+
>>+	/** @handle: Object handle */
>>+	__u32 handle;
>>+
>>+	/** @start: Virtual Address start to bind */
>>+	__u64 start;
>>+
>>+	/** @offset: Offset in object to bind */
>>+	__u64 offset;
>>+
>>+	/** @length: Length of mapping to bind */
>>+	__u64 length;
>
>Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? 
>Or if not userspace is expected to map the remainder of the space to a 
>dummy object? In which case would there be any alignment/padding 
>issues preventing the two bind to be placed next to each other?
>
>I ask because someone from the compute side asked me about a problem 
>with their strategy of dealing with overfetch and I suggested pad to 
>size.
>

Thanks Tvrtko,
I think we shouldn't be needing it. As with VM_BIND VA assignment
is completely pushed to userspace, no padding should be necessary
once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here.
Generally, 'start' and 'size' should be 4K aligned. But, I think
when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
be 64K aligned.

Niranjana

>Regards,
>
>Tvrtko
>
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_GEM_VM_BIND_READONLY:
>>+	 * Mapping is read-only.
>>+	 *
>>+	 * I915_GEM_VM_BIND_CAPTURE:
>>+	 * Capture this mapping in the dump upon GPU error.
>>+	 */
>>+	__u64 flags;
>>+#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>+#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>+
>>+	/** @extensions: 0-terminated chain of extensions for this mapping. */
>>+	__u64 extensions;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>+ *
>>+ * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
>>+ * address (VA) range that should be unbound from the device page table of the
>>+ * specified address space (VM). The specified VA range must match one of the
>>+ * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>+ * completion.
>>+ */
>>+struct drm_i915_gem_vm_unbind {
>>+	/** @vm_id: VM (address space) id to bind */
>>+	__u32 vm_id;
>>+
>>+	/** @rsvd: Reserved for future use; must be zero. */
>>+	__u32 rsvd;
>>+
>>+	/** @start: Virtual Address start to unbind */
>>+	__u64 start;
>>+
>>+	/** @length: Length of mapping to unbind */
>>+	__u64 length;
>>+
>>+	/** @flags: reserved for future usage, currently MBZ */
>>+	__u64 flags;
>>+
>>+	/** @extensions: 0-terminated chain of extensions for this mapping. */
>>+	__u64 extensions;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
>>+ * or the vm_unbind work.
>>+ *
>>+ * The vm_bind or vm_unbind aync worker will wait for input fence to signal
>>+ * before starting the binding or unbinding.
>>+ *
>>+ * The vm_bind or vm_unbind async worker will signal the returned output fence
>>+ * after the completion of binding or unbinding.
>>+ */
>>+struct drm_i915_vm_bind_fence {
>>+	/** @handle: User's handle for a drm_syncobj to wait on or signal. */
>>+	__u32 handle;
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_VM_BIND_FENCE_WAIT:
>>+	 * Wait for the input fence before binding/unbinding
>>+	 *
>>+	 * I915_VM_BIND_FENCE_SIGNAL:
>>+	 * Return bind/unbind completion fence as output
>>+	 */
>>+	__u32 flags;
>>+#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>+#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
>>+ * and vm_unbind.
>>+ *
>>+ * This structure describes an array of timeline drm_syncobj and associated
>>+ * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
>>+ * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>+ */
>>+struct drm_i915_vm_bind_ext_timeline_fences {
>>+#define I915_VM_BIND_EXT_timeline_FENCES	0
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/**
>>+	 * @fence_count: Number of elements in the @handles_ptr & @value_ptr
>>+	 * arrays.
>>+	 */
>>+	__u64 fence_count;
>>+
>>+	/**
>>+	 * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
>>+	 * of length @fence_count.
>>+	 */
>>+	__u64 handles_ptr;
>>+
>>+	/**
>>+	 * @values_ptr: Pointer to an array of u64 values of length
>>+	 * @fence_count.
>>+	 * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>+	 * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>>+	 * binary one.
>>+	 */
>>+	__u64 values_ptr;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_user_fence - An input or output user fence for the
>>+ * vm_bind or the vm_unbind work.
>>+ *
>>+ * The vm_bind or vm_unbind aync worker will wait for the input fence (value at
>>+ * @addr to become equal to @val) before starting the binding or unbinding.
>>+ *
>>+ * The vm_bind or vm_unbind async worker will signal the output fence after
>>+ * the completion of binding or unbinding by writing @val to memory location at
>>+ * @addr
>>+ */
>>+struct drm_i915_vm_bind_user_fence {
>>+	/** @addr: User/Memory fence qword aligned process virtual address */
>>+	__u64 addr;
>>+
>>+	/** @val: User/Memory fence value to be written after bind completion */
>>+	__u64 val;
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_VM_BIND_USER_FENCE_WAIT:
>>+	 * Wait for the input fence before binding/unbinding
>>+	 *
>>+	 * I915_VM_BIND_USER_FENCE_SIGNAL:
>>+	 * Return bind/unbind completion fence as output
>>+	 */
>>+	__u32 flags;
>>+#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>+#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>+	(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>+};
>>+
>>+/**
>>+ * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
>>+ * and vm_unbind.
>>+ *
>>+ * These user fences can be input or output fences
>>+ * (See struct drm_i915_vm_bind_user_fence).
>>+ */
>>+struct drm_i915_vm_bind_ext_user_fence {
>>+#define I915_VM_BIND_EXT_USER_FENCES	1
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/** @fence_count: Number of elements in the @user_fence_ptr array. */
>>+	__u64 fence_count;
>>+
>>+	/**
>>+	 * @user_fence_ptr: Pointer to an array of
>>+	 * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>+	 */
>>+	__u64 user_fence_ptr;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
>>+ * gpu virtual addresses.
>>+ *
>>+ * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
>>+ * must always be appended in the VM_BIND mode and it will be an error to
>>+ * append this extension in older non-VM_BIND mode.
>>+ */
>>+struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES	1
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/** @count: Number of addresses in the addr array. */
>>+	__u32 count;
>>+
>>+	/** @addr: An array of batch gpu virtual addresses. */
>>+	__u64 addr[0];
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
>>+ * signaling extension.
>>+ *
>>+ * This extension allows user to attach a user fence (@addr, @value pair) to an
>>+ * execbuf to be signaled by the command streamer after the completion of first
>>+ * level batch, by writing the @value at specified @addr and triggering an
>>+ * interrupt.
>>+ * User can either poll for this user fence to signal or can also wait on it
>>+ * with i915_gem_wait_user_fence ioctl.
>>+ * This is very much usefaul for long running contexts where waiting on dma-fence
>>+ * by user (like i915_gem_wait ioctl) is not supported.
>>+ */
>>+struct drm_i915_gem_execbuffer_ext_user_fence {
>>+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		2
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/**
>>+	 * @addr: User/Memory fence qword aligned GPU virtual address.
>>+	 *
>>+	 * Address has to be a valid GPU virtual address at the time of
>>+	 * first level batch completion.
>>+	 */
>>+	__u64 addr;
>>+
>>+	/**
>>+	 * @value: User/Memory fence Value to be written to above address
>>+	 * after first level batch completes.
>>+	 */
>>+	__u64 value;
>>+
>>+	/** @rsvd: Reserved for future extensions, MBZ */
>>+	__u64 rsvd;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_create_ext_vm_private - Extension to make the object
>>+ * private to the specified VM.
>>+ *
>>+ * See struct drm_i915_gem_create_ext.
>>+ */
>>+struct drm_i915_gem_create_ext_vm_private {
>>+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
>>+	/** @base: Extension link. See struct i915_user_extension. */
>>+	struct i915_user_extension base;
>>+
>>+	/** @vm_id: Id of the VM to which the object is private */
>>+	__u32 vm_id;
>>+};
>>+
>>+/**
>>+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>+ *
>>+ * User/Memory fence can be woken up either by:
>>+ *
>>+ * 1. GPU context indicated by @ctx_id, or,
>>+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>+ *    @ctx_id is ignored when this flag is set.
>>+ *
>>+ * Wakeup condition is,
>>+ * ``((*addr & mask) op (value & mask))``
>>+ *
>>+ * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
>>+ */
>>+struct drm_i915_gem_wait_user_fence {
>>+	/** @extensions: Zero-terminated chain of extensions. */
>>+	__u64 extensions;
>>+
>>+	/** @addr: User/Memory fence address */
>>+	__u64 addr;
>>+
>>+	/** @ctx_id: Id of the Context which will signal the fence. */
>>+	__u32 ctx_id;
>>+
>>+	/** @op: Wakeup condition operator */
>>+	__u16 op;
>>+#define I915_UFENCE_WAIT_EQ      0
>>+#define I915_UFENCE_WAIT_NEQ     1
>>+#define I915_UFENCE_WAIT_GT      2
>>+#define I915_UFENCE_WAIT_GTE     3
>>+#define I915_UFENCE_WAIT_LT      4
>>+#define I915_UFENCE_WAIT_LTE     5
>>+#define I915_UFENCE_WAIT_BEFORE  6
>>+#define I915_UFENCE_WAIT_AFTER   7
>>+
>>+	/**
>>+	 * @flags: Supported flags are,
>>+	 *
>>+	 * I915_UFENCE_WAIT_SOFT:
>>+	 *
>>+	 * To be woken up by i915 driver async worker (not by GPU).
>>+	 *
>>+	 * I915_UFENCE_WAIT_ABSTIME:
>>+	 *
>>+	 * Wait timeout specified as absolute time.
>>+	 */
>>+	__u16 flags;
>>+#define I915_UFENCE_WAIT_SOFT    0x1
>>+#define I915_UFENCE_WAIT_ABSTIME 0x2
>>+
>>+	/** @value: Wakeup value */
>>+	__u64 value;
>>+
>>+	/** @mask: Wakeup mask */
>>+	__u64 mask;
>>+#define I915_UFENCE_WAIT_U8     0xffu
>>+#define I915_UFENCE_WAIT_U16    0xffffu
>>+#define I915_UFENCE_WAIT_U32    0xfffffffful
>>+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>+
>>+	/**
>>+	 * @timeout: Wait timeout in nanoseconds.
>>+	 *
>>+	 * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
>>+	 * absolute time in nsec.
>>+	 */
>>+	__s64 timeout;
>>+};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 10:42               ` Tvrtko Ursulin
@ 2022-06-07 21:25                 ` Niranjana Vishwanathapura
  2022-06-08  7:34                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 21:25 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
>
>On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
>>On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
>>>On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>><niranjana.vishwanathapura@intel.com> wrote:
>>>>>>
>>>>>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>
>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>     Also add new uapi and documentation as per review comments
>>>>>>>>     from Daniel.
>>>>>>>>
>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>><niranjana.vishwanathapura@intel.com>
>>>>>>>> ---
>>>>>>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>+++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 399 insertions(+)
>>>>>>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>
>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> new file mode 100644
>>>>>>>> index 000000000000..589c0a009107
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>> +/*
>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>> + *
>>>>>>>> + * VM_BIND feature availability.
>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>> + */
>>>>>>>> +#define I915_PARAM_HAS_VM_BIND               57
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>> + *
>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>> + *
>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>execbuff mode of binding.
>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>execlist (ie., the
>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>must be provided
>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>> + *
>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>&drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS 
>>>>>>flag must always be
>>>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>batch_len fields
>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used 
>>>>>>and must be 0.
>>>>>>>> + */
>>>>>>>
>>>>>>>From that description, it seems we have:
>>>>>>>
>>>>>>>struct drm_i915_gem_execbuffer2 {
>>>>>>>        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>        __u32 buffer_count;             -> must be 0 (new)
>>>>>>>        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>        __u32 batch_len;                -> must be 0 (new)
>>>>>>>        __u32 DR1;                      -> must be 0 (old)
>>>>>>>        __u32 DR4;                      -> must be 0 (old)
>>>>>>>        __u32 num_cliprects; (fences)   -> must be 0 since 
>>>>>>using extensions
>>>>>>>        __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>contains an actual pointer!
>>>>>>>        __u64 flags;                    -> some flags must be 0 (new)
>>>>>>>        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>        __u64 rsvd2;                    -> unused
>>>>>>>};
>>>>>>>
>>>>>>>Based on that, why can't we just get drm_i915_gem_execbuffer3 instead
>>>>>>>of adding even more complexity to an already abused interface? While
>>>>>>>the Vulkan-like extension thing is really nice, I don't think what
>>>>>>>we're doing here is extending the ioctl usage, we're completely
>>>>>>>changing how the base struct should be interpreted based on 
>>>>>>how the VM
>>>>>>>was created (which is an entirely different ioctl).
>>>>>>>
>>>>>>>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>>>already at -6 without these changes. I think after vm_bind we'll need
>>>>>>>to create a -11 entry just to deal with this ioctl.
>>>>>>>
>>>>>>
>>>>>>The only change here is removing the execlist support for VM_BIND
>>>>>>mode (other than natual extensions).
>>>>>>Adding a new execbuffer3 was considered, but I think we need 
>>>>>>to be careful
>>>>>>with that as that goes beyond the VM_BIND support, including 
>>>>>>any future
>>>>>>requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>
>>>>>Why not? it's not like adding extensions here is really that different
>>>>>than adding new ioctls.
>>>>>
>>>>>I definitely think this deserves an execbuffer3 without even
>>>>>considering future requirements. Just  to burn down the old
>>>>>requirements and pointless fields.
>>>>>
>>>>>Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>>>older sw on execbuf2 for ever.
>>>>
>>>>I guess another point in favour of execbuf3 would be that it's less
>>>>midlayer. If we share the entry point then there's quite a few vfuncs
>>>>needed to cleanly split out the vm_bind paths from the legacy
>>>>reloc/softping paths.
>>>>
>>>>If we invert this and do execbuf3, then there's the existing ioctl
>>>>vfunc, and then we share code (where it even makes sense, probably
>>>>request setup/submit need to be shared, anything else is probably
>>>>cleaner to just copypaste) with the usual helper approach.
>>>>
>>>>Also that would guarantee that really none of the old concepts like
>>>>i915_active on the vma or vma open counts and all that stuff leaks
>>>>into the new vm_bind execbuf.
>>>>
>>>>Finally I also think that copypasting would make backporting easier,
>>>>or at least more flexible, since it should make it easier to have the
>>>>upstream vm_bind co-exist with all the other things we have. Without
>>>>huge amounts of conflicts (or at least much less) that pushing a pile
>>>>of vfuncs into the existing code would cause.
>>>>
>>>>So maybe we should do this?
>>>
>>>Thanks Dave, Daniel.
>>>There are a few things that will be common between execbuf2 and
>>>execbuf3, like request setup/submit (as you said), fence handling 
>>>(timeline fences, fence array, composite fences), engine 
>>>selection,
>>>etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>>bit position will differ).
>>>But I guess these should be fine as the suggestion here is to
>>>copy-paste the execbuff code and having a shared code where possible.
>>>Besides, we can stop supporting some older feature in execbuff3
>>>(like fence array in favor of newer timeline fences), which will
>>>further reduce common code.
>>>
>>>Ok, I will update this series by adding execbuf3 and send out soon.
>>>
>>
>>Does this sound reasonable?
>>
>>struct drm_i915_gem_execbuffer3 {
>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>
>>        __u32 batch_count;
>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>>virtual addresses */
>
>Casual stumble upon..
>
>Alternatively you could embed N pointers to make life a bit easier for 
>both userspace and kernel side. Yes, but then "N batch buffers should 
>be enough for everyone" problem.. :)
>

Thanks Tvrtko,
Yes, hence the batch_addr_ptr.

>>
>>        __u64 flags;
>>#define I915_EXEC3_RING_MASK              (0x3f)
>>#define I915_EXEC3_DEFAULT                (0<<0)
>>#define I915_EXEC3_RENDER                 (1<<0)
>>#define I915_EXEC3_BSD                    (2<<0)
>>#define I915_EXEC3_BLT                    (3<<0)
>>#define I915_EXEC3_VEBOX                  (4<<0)
>>
>>#define I915_EXEC3_SECURE               (1<<6)
>>#define I915_EXEC3_IS_PINNED            (1<<7)
>>
>>#define I915_EXEC3_BSD_SHIFT     (8)
>>#define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>>#define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>>#define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>>#define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>
>I'd suggest legacy engine selection is unwanted, especially not with 
>the convoluted BSD1/2 flags. Can we just require context with engine 
>map and index? Or if default context has to be supported then I'd 
>suggest ...class_instance for that mode.
>

Ok, I will be happy to remove it and only support contexts with
engine map, if UMDs agree on that.

>>#define I915_EXEC3_FENCE_IN             (1<<10)
>>#define I915_EXEC3_FENCE_OUT            (1<<11)
>>#define I915_EXEC3_FENCE_SUBMIT         (1<<12)
>
>People are likely to object to submit fence since generic mechanism to 
>align submissions was rejected.
>

Ok, again, I can remove it if UMDs are ok with it.

>>
>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>
>New ioctl you can afford dedicated fields.
>

Yes, but as I asked below, I am not sure if we need this or the
timeline fence arry extension we have is good enough.

>In any case I suggest you involve UMD folks in designing it.
>

Yah.
Paulo, Lionel, Jason, Daniel, can you comment on these regarding
what will UMD need in execbuf3 and what can be removed?

Thanks,
Niranjana

>Regards,
>
>Tvrtko
>
>>
>>        __u64 extensions;        /* currently only for 
>>DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>>};
>>
>>With this, user can pass in batch addresses and count directly,
>>instead of as an extension (as this rfc series was proposing).
>>
>>I have removed many of the flags which were either legacy or not
>>applicable to BM_BIND mode.
>>I have also removed fence array support (execbuffer2.cliprects_ptr)
>>as we have timeline fence array support. Is that fine?
>>Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>
>>Any thing else needs to be added or removed?
>>
>>Niranjana
>>
>>>Niranjana
>>>
>>>>-Daniel
>>>>-- 
>>>>Daniel Vetter
>>>>Software Engineer, Intel Corporation
>>>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-07 18:18                   ` Niranjana Vishwanathapura
  (?)
@ 2022-06-07 21:32                   ` Niranjana Vishwanathapura
  2022-06-08  7:33                     ` Tvrtko Ursulin
  -1 siblings, 1 reply; 121+ messages in thread
From: Niranjana Vishwanathapura @ 2022-06-07 21:32 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Intel GFX, Maling list - DRI developers, Thomas Hellstrom,
	Chris Wilson, Daniel Vetter, Christian König

On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>  On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>  <niranjana.vishwanathapura@intel.com> wrote:
>>
>>    On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>    >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>    >
>>    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>    >     <niranjana.vishwanathapura@intel.com> wrote:
>>    >
>>    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>>    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>    wrote:
>>    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>    >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>    binding/unbinding
>>    >       the mapping in an
>>    >       >> > +async worker. The binding and unbinding will work like a
>>    special
>>    >       GPU engine.
>>    >       >> > +The binding and unbinding operations are serialized and
>>    will
>>    >       wait on specified
>>    >       >> > +input fences before the operation and will signal the
>>    output
>>    >       fences upon the
>>    >       >> > +completion of the operation. Due to serialization,
>>    completion of
>>    >       an operation
>>    >       >> > +will also indicate that all previous operations are also
>>    >       complete.
>>    >       >>
>>    >       >> I guess we should avoid saying "will immediately start
>>    >       binding/unbinding" if
>>    >       >> there are fences involved.
>>    >       >>
>>    >       >> And the fact that it's happening in an async worker seem to
>>    imply
>>    >       it's not
>>    >       >> immediate.
>>    >       >>
>>    >
>>    >       Ok, will fix.
>>    >       This was added because in earlier design binding was deferred
>>    until
>>    >       next execbuff.
>>    >       But now it is non-deferred (immediate in that sense). But yah,
>>    this is
>>    >       confusing
>>    >       and will fix it.
>>    >
>>    >       >>
>>    >       >> I have a question on the behavior of the bind operation when
>>    no
>>    >       input fence
>>    >       >> is provided. Let say I do :
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence1)
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence2)
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence3)
>>    >       >>
>>    >       >>
>>    >       >> In what order are the fences going to be signaled?
>>    >       >>
>>    >       >> In the order of VM_BIND ioctls? Or out of order?
>>    >       >>
>>    >       >> Because you wrote "serialized I assume it's : in order
>>    >       >>
>>    >
>>    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>>    unbind
>>    >       will use
>>    >       the same queue and hence are ordered.
>>    >
>>    >       >>
>>    >       >> One thing I didn't realize is that because we only get one
>>    >       "VM_BIND" engine,
>>    >       >> there is a disconnect from the Vulkan specification.
>>    >       >>
>>    >       >> In Vulkan VM_BIND operations are serialized but per engine.
>>    >       >>
>>    >       >> So you could have something like this :
>>    >       >>
>>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>    >       >>
>>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>    >       >>
>>    >       >>
>>    >       >> fence1 is not signaled
>>    >       >>
>>    >       >> fence3 is signaled
>>    >       >>
>>    >       >> So the second VM_BIND will proceed before the first VM_BIND.
>>    >       >>
>>    >       >>
>>    >       >> I guess we can deal with that scenario in userspace by doing
>>    the
>>    >       wait
>>    >       >> ourselves in one thread per engines.
>>    >       >>
>>    >       >> But then it makes the VM_BIND input fences useless.
>>    >       >>
>>    >       >>
>>    >       >> Daniel : what do you think? Should be rework this or just
>>    deal with
>>    >       wait
>>    >       >> fences in userspace?
>>    >       >>
>>    >       >
>>    >       >My opinion is rework this but make the ordering via an engine
>>    param
>>    >       optional.
>>    >       >
>>    >       >e.g. A VM can be configured so all binds are ordered within the
>>    VM
>>    >       >
>>    >       >e.g. A VM can be configured so all binds accept an engine
>>    argument
>>    >       (in
>>    >       >the case of the i915 likely this is a gem context handle) and
>>    binds
>>    >       >ordered with respect to that engine.
>>    >       >
>>    >       >This gives UMDs options as the later likely consumes more KMD
>>    >       resources
>>    >       >so if a different UMD can live with binds being ordered within
>>    the VM
>>    >       >they can use a mode consuming less resources.
>>    >       >
>>    >
>>    >       I think we need to be careful here if we are looking for some
>>    out of
>>    >       (submission) order completion of vm_bind/unbind.
>>    >       In-order completion means, in a batch of binds and unbinds to be
>>    >       completed in-order, user only needs to specify in-fence for the
>>    >       first bind/unbind call and the our-fence for the last
>>    bind/unbind
>>    >       call. Also, the VA released by an unbind call can be re-used by
>>    >       any subsequent bind call in that in-order batch.
>>    >
>>    >       These things will break if binding/unbinding were to be allowed
>>    to
>>    >       go out of order (of submission) and user need to be extra
>>    careful
>>    >       not to run into pre-mature triggereing of out-fence and bind
>>    failing
>>    >       as VA is still in use etc.
>>    >
>>    >       Also, VM_BIND binds the provided mapping on the specified
>>    address
>>    >       space
>>    >       (VM). So, the uapi is not engine/context specific.
>>    >
>>    >       We can however add a 'queue' to the uapi which can be one from
>>    the
>>    >       pre-defined queues,
>>    >       I915_VM_BIND_QUEUE_0
>>    >       I915_VM_BIND_QUEUE_1
>>    >       ...
>>    >       I915_VM_BIND_QUEUE_(N-1)
>>    >
>>    >       KMD will spawn an async work queue for each queue which will
>>    only
>>    >       bind the mappings on that queue in the order of submission.
>>    >       User can assign the queue to per engine or anything like that.
>>    >
>>    >       But again here, user need to be careful and not deadlock these
>>    >       queues with circular dependency of fences.
>>    >
>>    >       I prefer adding this later an as extension based on whether it
>>    >       is really helping with the implementation.
>>    >
>>    >     I can tell you right now that having everything on a single
>>    in-order
>>    >     queue will not get us the perf we want.  What vulkan really wants
>>    is one
>>    >     of two things:
>>    >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>    whatever
>>    >     their dependencies are resolved and we ensure ordering ourselves
>>    by
>>    >     having a syncobj in the VkQueue.
>>    >      2. The ability to create multiple VM_BIND queues.  We need at
>>    least 2
>>    >     but I don't see why there needs to be a limit besides the limits
>>    the
>>    >     i915 API already has on the number of engines.  Vulkan could
>>    expose
>>    >     multiple sparse binding queues to the client if it's not
>>    arbitrarily
>>    >     limited.
>>
>>    Thanks Jason, Lionel.
>>
>>    Jason, what are you referring to when you say "limits the i915 API
>>    already
>>    has on the number of engines"? I am not sure if there is such an uapi
>>    today.
>>
>>  There's a limit of something like 64 total engines today based on the
>>  number of bits we can cram into the exec flags in execbuffer2.  I think
>>  someone had an extended version that allowed more but I ripped it out
>>  because no one was using it.  Of course, execbuffer3 might not have that
>>  problem at all.
>>
>
>Thanks Jason.
>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>and somehow export it to user (I am thinking of embedding it in
>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE   64
        __u32 queue;

I think that will keep things simple.

Niranjana

>
>>    I am trying to see how many queues we need and don't want it to be
>>    arbitrarily
>>    large and unduely blow up memory usage and complexity in i915 driver.
>>
>>  I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>>  could imagine a client wanting to create more than 1 sparse queue in which
>>  case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>>  you allow two, I don't think the complexity is going up by allowing N.  As
>>  for memory usage, creating more queues means more memory.  That's a
>>  trade-off that userspace can make.  Again, the expected number here is 1
>>  or 2 in the vast majority of cases so I don't think you need to worry.
>
>Ok, will start with n=3 meaning 8 queues.
>That would require us create 8 workqueues.
>We can change 'n' later if required.
>
>Niranjana
>
>>
>>    >     Why?  Because Vulkan has two basic kind of bind operations and we
>>    don't
>>    >     want any dependencies between them:
>>    >      1. Immediate.  These happen right after BO creation or maybe as
>>    part of
>>    >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>>    on a
>>    >     queue and we don't want them serialized with anything.  To
>>    synchronize
>>    >     with submit, we'll have a syncobj in the VkDevice which is
>>    signaled by
>>    >     all immediate bind operations and make submits wait on it.
>>    >      2. Queued (sparse): These happen on a VkQueue which may be the
>>    same as
>>    >     a render/compute queue or may be its own queue.  It's up to us
>>    what we
>>    >     want to advertise.  From the Vulkan API PoV, this is like any
>>    other
>>    >     queue.  Operations on it wait on and signal semaphores.  If we
>>    have a
>>    >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>>    we do
>>    >     in execbuf().
>>    >     The important thing is that we don't want one type of operation to
>>    block
>>    >     on the other.  If immediate binds are blocking on sparse binds,
>>    it's
>>    >     going to cause over-synchronization issues.
>>    >     In terms of the internal implementation, I know that there's going
>>    to be
>>    >     a lock on the VM and that we can't actually do these things in
>>    >     parallel.  That's fine.  Once the dma_fences have signaled and
>>    we're
>>
>>    Thats correct. It is like a single VM_BIND engine with multiple queues
>>    feeding to it.
>>
>>  Right.  As long as the queues themselves are independent and can block on
>>  dma_fences without holding up other queues, I think we're fine.
>>
>>    >     unblocked to do the bind operation, I don't care if there's a bit
>>    of
>>    >     synchronization due to locking.  That's expected.  What we can't
>>    afford
>>    >     to have is an immediate bind operation suddenly blocking on a
>>    sparse
>>    >     operation which is blocked on a compute job that's going to run
>>    for
>>    >     another 5ms.
>>
>>    As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>    VM_BIND
>>    on other VMs. I am not sure about usecases here, but just wanted to
>>    clarify.
>>
>>  Yes, that's what I would expect.
>>  --Jason
>>
>>    Niranjana
>>
>>    >     For reference, Windows solves this by allowing arbitrarily many
>>    paging
>>    >     queues (what they call a VM_BIND engine/queue).  That design works
>>    >     pretty well and solves the problems in question.  Again, we could
>>    just
>>    >     make everything out-of-order and require using syncobjs to order
>>    things
>>    >     as userspace wants. That'd be fine too.
>>    >     One more note while I'm here: danvet said something on IRC about
>>    VM_BIND
>>    >     queues waiting for syncobjs to materialize.  We don't really
>>    want/need
>>    >     this.  We already have all the machinery in userspace to handle
>>    >     wait-before-signal and waiting for syncobj fences to materialize
>>    and
>>    >     that machinery is on by default.  It would actually take MORE work
>>    in
>>    >     Mesa to turn it off and take advantage of the kernel being able to
>>    wait
>>    >     for syncobjs to materialize.  Also, getting that right is
>>    ridiculously
>>    >     hard and I really don't want to get it wrong in kernel 
>>space.     When we
>>    >     do memory fences, wait-before-signal will be a thing.  We don't
>>    need to
>>    >     try and make it a thing for syncobj.
>>    >     --Jason
>>    >
>>    >   Thanks Jason,
>>    >
>>    >   I missed the bit in the Vulkan spec that we're allowed to have a
>>    sparse
>>    >   queue that does not implement either graphics or compute operations
>>    :
>>    >
>>    >     "While some implementations may include
>>    VK_QUEUE_SPARSE_BINDING_BIT
>>    >     support in queue families that also include
>>    >
>>    >      graphics and compute support, other implementations may only
>>    expose a
>>    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>    >
>>    >      family."
>>    >
>>    >   So it can all be all a vm_bind engine that just does bind/unbind
>>    >   operations.
>>    >
>>    >   But yes we need another engine for the immediate/non-sparse
>>    operations.
>>    >
>>    >   -Lionel
>>    >
>>    >         >
>>    >       Daniel, any thoughts?
>>    >
>>    >       Niranjana
>>    >
>>    >       >Matt
>>    >       >
>>    >       >>
>>    >       >> Sorry I noticed this late.
>>    >       >>
>>    >       >>
>>    >       >> -Lionel
>>    >       >>
>>    >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-03  6:53             ` Niranjana Vishwanathapura
  2022-06-07 10:42               ` Tvrtko Ursulin
@ 2022-06-08  6:40               ` Lionel Landwerlin
  2022-06-08  6:43                 ` Lionel Landwerlin
  2022-06-08  8:36                 ` Tvrtko Ursulin
  2022-06-08  7:12               ` Lionel Landwerlin
  2 siblings, 2 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  6:40 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
> wrote:
>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>> >> VM_BIND and related uapi definitions
>>>>> >>
>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>> >>     Also add new uapi and documentation as per review comments
>>>>> >>     from Daniel.
>>>>> >>
>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>> <niranjana.vishwanathapura@intel.com>
>>>>> >> ---
>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>> +++++++++++++++++++++++++++
>>>>> >>  1 file changed, 399 insertions(+)
>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >>
>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> new file mode 100644
>>>>> >> index 000000000000..589c0a009107
>>>>> >> --- /dev/null
>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> @@ -0,0 +1,399 @@
>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>> >> +/*
>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>> >> + */
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>> >> + *
>>>>> >> + * VM_BIND feature availability.
>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>> >> + */
>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>> >> + *
>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>> >> + *
>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>> mode of binding.
>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>> execlist (ie., the
>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>> be provided
>>>>> >> + * to pass in the batch buffer addresses.
>>>>> >> + *
>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>> must be 0
>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>> must always be
>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>> batch_len fields
>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>> must be 0.
>>>>> >> + */
>>>>> >
>>>>> >From that description, it seems we have:
>>>>> >
>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>> extensions
>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>> actual pointer!
>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>> (new)
>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>> >        __u64 rsvd2;                    -> unused
>>>>> >};
>>>>> >
>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>> instead
>>>>> >of adding even more complexity to an already abused interface? While
>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>> >changing how the base struct should be interpreted based on how 
>>>>> the VM
>>>>> >was created (which is an entirely different ioctl).
>>>>> >
>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>> need
>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>> >
>>>>>
>>>>> The only change here is removing the execlist support for VM_BIND
>>>>> mode (other than natual extensions).
>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>> careful
>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>> future
>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>
>>>> Why not? it's not like adding extensions here is really that different
>>>> than adding new ioctls.
>>>>
>>>> I definitely think this deserves an execbuffer3 without even
>>>> considering future requirements. Just  to burn down the old
>>>> requirements and pointless fields.
>>>>
>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>> older sw on execbuf2 for ever.
>>>
>>> I guess another point in favour of execbuf3 would be that it's less
>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>> needed to cleanly split out the vm_bind paths from the legacy
>>> reloc/softping paths.
>>>
>>> If we invert this and do execbuf3, then there's the existing ioctl
>>> vfunc, and then we share code (where it even makes sense, probably
>>> request setup/submit need to be shared, anything else is probably
>>> cleaner to just copypaste) with the usual helper approach.
>>>
>>> Also that would guarantee that really none of the old concepts like
>>> i915_active on the vma or vma open counts and all that stuff leaks
>>> into the new vm_bind execbuf.
>>>
>>> Finally I also think that copypasting would make backporting easier,
>>> or at least more flexible, since it should make it easier to have the
>>> upstream vm_bind co-exist with all the other things we have. Without
>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>> of vfuncs into the existing code would cause.
>>>
>>> So maybe we should do this?
>>
>> Thanks Dave, Daniel.
>> There are a few things that will be common between execbuf2 and
>> execbuf3, like request setup/submit (as you said), fence handling 
>> (timeline fences, fence array, composite fences), engine selection,
>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>> bit position will differ).
>> But I guess these should be fine as the suggestion here is to
>> copy-paste the execbuff code and having a shared code where possible.
>> Besides, we can stop supporting some older feature in execbuff3
>> (like fence array in favor of newer timeline fences), which will
>> further reduce common code.
>>
>> Ok, I will update this series by adding execbuf3 and send out soon.
>>
>
> Does this sound reasonable?


Thanks for proposing this. Some comments below.


>
> struct drm_i915_gem_execbuffer3 {
>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>
>        __u32 batch_count;
>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
> virtual addresses */
>
>        __u64 flags;
> #define I915_EXEC3_RING_MASK              (0x3f)
> #define I915_EXEC3_DEFAULT                (0<<0)
> #define I915_EXEC3_RENDER                 (1<<0)
> #define I915_EXEC3_BSD                    (2<<0)
> #define I915_EXEC3_BLT                    (3<<0)
> #define I915_EXEC3_VEBOX                  (4<<0)


Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in 
drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the 
context engine map).


>
> #define I915_EXEC3_SECURE               (1<<6)
> #define I915_EXEC3_IS_PINNED            (1<<7)


What's the meaning of PINNED?


>
> #define I915_EXEC3_BSD_SHIFT     (8)
> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>
> #define I915_EXEC3_FENCE_IN             (1<<10)
> #define I915_EXEC3_FENCE_OUT            (1<<11)


For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 
support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?


> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)


What's FENCE_SUBMIT?


>
>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>
>        __u64 extensions;        /* currently only for 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> };
>
> With this, user can pass in batch addresses and count directly,
> instead of as an extension (as this rfc series was proposing).
>
> I have removed many of the flags which were either legacy or not
> applicable to BM_BIND mode.
> I have also removed fence array support (execbuffer2.cliprects_ptr)
> as we have timeline fence array support. Is that fine?
> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>
> Any thing else needs to be added or removed?
>
> Niranjana
>
>> Niranjana
>>
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-08  6:40               ` Lionel Landwerlin
@ 2022-06-08  6:43                 ` Lionel Landwerlin
  2022-06-08  8:36                 ` Tvrtko Ursulin
  1 sibling, 0 replies; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  6:43 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 08/06/2022 09:40, Lionel Landwerlin wrote:
> On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>> wrote:
>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>
>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>> >> VM_BIND and related uapi definitions
>>>>>> >>
>>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>> >>     Also add new uapi and documentation as per review comments
>>>>>> >>     from Daniel.
>>>>>> >>
>>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>> >> ---
>>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>> +++++++++++++++++++++++++++
>>>>>> >>  1 file changed, 399 insertions(+)
>>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >>
>>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >> new file mode 100644
>>>>>> >> index 000000000000..589c0a009107
>>>>>> >> --- /dev/null
>>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>> >> @@ -0,0 +1,399 @@
>>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>>> >> +/*
>>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>>> >> + */
>>>>>> >> +
>>>>>> >> +/**
>>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>> >> + *
>>>>>> >> + * VM_BIND feature availability.
>>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>>> >> + */
>>>>>> >> +#define I915_PARAM_HAS_VM_BIND 57
>>>>>> >> +
>>>>>> >> +/**
>>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>> >> + *
>>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM 
>>>>>> creation.
>>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>>> >> + *
>>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>>> mode of binding.
>>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>> execlist (ie., the
>>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>>> be provided
>>>>>> >> + * to pass in the batch buffer addresses.
>>>>>> >> + *
>>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>>> must be 0
>>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>>> must always be
>>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>> batch_len fields
>>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>>> must be 0.
>>>>>> >> + */
>>>>>> >
>>>>>> >From that description, it seems we have:
>>>>>> >
>>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>>> extensions
>>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>>> actual pointer!
>>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>>> (new)
>>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>> >        __u64 rsvd2;                    -> unused
>>>>>> >};
>>>>>> >
>>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>> instead
>>>>>> >of adding even more complexity to an already abused interface? 
>>>>>> While
>>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>>> >changing how the base struct should be interpreted based on how 
>>>>>> the VM
>>>>>> >was created (which is an entirely different ioctl).
>>>>>> >
>>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>>> need
>>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>>> >
>>>>>>
>>>>>> The only change here is removing the execlist support for VM_BIND
>>>>>> mode (other than natual extensions).
>>>>>> Adding a new execbuffer3 was considered, but I think we need to 
>>>>>> be careful
>>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>>> future
>>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>>
>>>>> Why not? it's not like adding extensions here is really that 
>>>>> different
>>>>> than adding new ioctls.
>>>>>
>>>>> I definitely think this deserves an execbuffer3 without even
>>>>> considering future requirements. Just  to burn down the old
>>>>> requirements and pointless fields.
>>>>>
>>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave 
>>>>> the
>>>>> older sw on execbuf2 for ever.
>>>>
>>>> I guess another point in favour of execbuf3 would be that it's less
>>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>>> needed to cleanly split out the vm_bind paths from the legacy
>>>> reloc/softping paths.
>>>>
>>>> If we invert this and do execbuf3, then there's the existing ioctl
>>>> vfunc, and then we share code (where it even makes sense, probably
>>>> request setup/submit need to be shared, anything else is probably
>>>> cleaner to just copypaste) with the usual helper approach.
>>>>
>>>> Also that would guarantee that really none of the old concepts like
>>>> i915_active on the vma or vma open counts and all that stuff leaks
>>>> into the new vm_bind execbuf.
>>>>
>>>> Finally I also think that copypasting would make backporting easier,
>>>> or at least more flexible, since it should make it easier to have the
>>>> upstream vm_bind co-exist with all the other things we have. Without
>>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>>> of vfuncs into the existing code would cause.
>>>>
>>>> So maybe we should do this?
>>>
>>> Thanks Dave, Daniel.
>>> There are a few things that will be common between execbuf2 and
>>> execbuf3, like request setup/submit (as you said), fence handling 
>>> (timeline fences, fence array, composite fences), engine selection,
>>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>>> bit position will differ).
>>> But I guess these should be fine as the suggestion here is to
>>> copy-paste the execbuff code and having a shared code where possible.
>>> Besides, we can stop supporting some older feature in execbuff3
>>> (like fence array in favor of newer timeline fences), which will
>>> further reduce common code.
>>>
>>> Ok, I will update this series by adding execbuf3 and send out soon.
>>>
>>
>> Does this sound reasonable?
>
>
> Thanks for proposing this. Some comments below.
>
>
>>
>> struct drm_i915_gem_execbuffer3 {
>>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>>
>>        __u32 batch_count;
>>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
>> virtual addresses */
>>
>>        __u64 flags;
>> #define I915_EXEC3_RING_MASK              (0x3f)
>> #define I915_EXEC3_DEFAULT                (0<<0)
>> #define I915_EXEC3_RENDER                 (1<<0)
>> #define I915_EXEC3_BSD                    (2<<0)
>> #define I915_EXEC3_BLT                    (3<<0)
>> #define I915_EXEC3_VEBOX                  (4<<0)
>
>
> Shouldn't we use the new engine selection uAPI instead?
>
> We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in 
> drm_i915_gem_context_create_ext_setparam.
>
> And you can also create virtual engines with the same extension.
>
> It feels like this could be a single u32 with the engine index (in the 
> context engine map).
>
>
>>
>> #define I915_EXEC3_SECURE               (1<<6)
>> #define I915_EXEC3_IS_PINNED            (1<<7)
>
>
> What's the meaning of PINNED?
>
>
>>
>> #define I915_EXEC3_BSD_SHIFT     (8)
>> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
>> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>>
>> #define I915_EXEC3_FENCE_IN             (1<<10)
>> #define I915_EXEC3_FENCE_OUT            (1<<11)
>
>
> For Mesa, as soon as we have 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
>
> So there isn't much point for FENCE_IN/OUT.
>
> Maybe check with other UMDs?


Correcting myself a bit here :

     - iris uses I915_EXEC_FENCE_ARRAY

     - anv uses I915_EXEC_FENCE_ARRAY or 
DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES


In either case we could easily switch to 
DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES all the time.


>
>
>> #define I915_EXEC3_FENCE_SUBMIT (1<<12)
>
>
> What's FENCE_SUBMIT?
>
>
>>
>>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>>
>>        __u64 extensions;        /* currently only for 
>> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
>> };
>>
>> With this, user can pass in batch addresses and count directly,
>> instead of as an extension (as this rfc series was proposing).
>>
>> I have removed many of the flags which were either legacy or not
>> applicable to BM_BIND mode.
>> I have also removed fence array support (execbuffer2.cliprects_ptr)
>> as we have timeline fence array support. Is that fine?
>> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>>
>> Any thing else needs to be added or removed?
>>
>> Niranjana
>>
>>> Niranjana
>>>
>>>> -Daniel
>>>> -- 
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch
>
>


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-03  6:53             ` Niranjana Vishwanathapura
  2022-06-07 10:42               ` Tvrtko Ursulin
  2022-06-08  6:40               ` Lionel Landwerlin
@ 2022-06-08  7:12               ` Lionel Landwerlin
  2022-06-08 21:24                   ` Matthew Brost
  2 siblings, 1 reply; 121+ messages in thread
From: Lionel Landwerlin @ 2022-06-08  7:12 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Daniel Vetter
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
> wrote:
>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>
>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>> >> VM_BIND and related uapi definitions
>>>>> >>
>>>>> >> v2: Ensure proper kernel-doc formatting with cross references.
>>>>> >>     Also add new uapi and documentation as per review comments
>>>>> >>     from Daniel.
>>>>> >>
>>>>> >> Signed-off-by: Niranjana Vishwanathapura 
>>>>> <niranjana.vishwanathapura@intel.com>
>>>>> >> ---
>>>>> >>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>> +++++++++++++++++++++++++++
>>>>> >>  1 file changed, 399 insertions(+)
>>>>> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >>
>>>>> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> new file mode 100644
>>>>> >> index 000000000000..589c0a009107
>>>>> >> --- /dev/null
>>>>> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>> >> @@ -0,0 +1,399 @@
>>>>> >> +/* SPDX-License-Identifier: MIT */
>>>>> >> +/*
>>>>> >> + * Copyright © 2022 Intel Corporation
>>>>> >> + */
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>> >> + *
>>>>> >> + * VM_BIND feature availability.
>>>>> >> + * See typedef drm_i915_getparam_t param.
>>>>> >> + */
>>>>> >> +#define I915_PARAM_HAS_VM_BIND               57
>>>>> >> +
>>>>> >> +/**
>>>>> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>> >> + *
>>>>> >> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>> >> + * See struct drm_i915_gem_vm_control flags.
>>>>> >> + *
>>>>> >> + * A VM in VM_BIND mode will not support the older execbuff 
>>>>> mode of binding.
>>>>> >> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>> execlist (ie., the
>>>>> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
>>>>> be provided
>>>>> >> + * to pass in the batch buffer addresses.
>>>>> >> + *
>>>>> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags 
>>>>> must be 0
>>>>> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
>>>>> must always be
>>>>> >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>> >> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>> batch_len fields
>>>>> >> + * of struct drm_i915_gem_execbuffer2 are also not used and 
>>>>> must be 0.
>>>>> >> + */
>>>>> >
>>>>> >From that description, it seems we have:
>>>>> >
>>>>> >struct drm_i915_gem_execbuffer2 {
>>>>> >        __u64 buffers_ptr;              -> must be 0 (new)
>>>>> >        __u32 buffer_count;             -> must be 0 (new)
>>>>> >        __u32 batch_start_offset;       -> must be 0 (new)
>>>>> >        __u32 batch_len;                -> must be 0 (new)
>>>>> >        __u32 DR1;                      -> must be 0 (old)
>>>>> >        __u32 DR4;                      -> must be 0 (old)
>>>>> >        __u32 num_cliprects; (fences)   -> must be 0 since using 
>>>>> extensions
>>>>> >        __u64 cliprects_ptr; (fences, extensions) -> contains an 
>>>>> actual pointer!
>>>>> >        __u64 flags;                    -> some flags must be 0 
>>>>> (new)
>>>>> >        __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>> >        __u64 rsvd2;                    -> unused
>>>>> >};
>>>>> >
>>>>> >Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>> instead
>>>>> >of adding even more complexity to an already abused interface? While
>>>>> >the Vulkan-like extension thing is really nice, I don't think what
>>>>> >we're doing here is extending the ioctl usage, we're completely
>>>>> >changing how the base struct should be interpreted based on how 
>>>>> the VM
>>>>> >was created (which is an entirely different ioctl).
>>>>> >
>>>>> >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>>>>> >already at -6 without these changes. I think after vm_bind we'll 
>>>>> need
>>>>> >to create a -11 entry just to deal with this ioctl.
>>>>> >
>>>>>
>>>>> The only change here is removing the execlist support for VM_BIND
>>>>> mode (other than natual extensions).
>>>>> Adding a new execbuffer3 was considered, but I think we need to be 
>>>>> careful
>>>>> with that as that goes beyond the VM_BIND support, including any 
>>>>> future
>>>>> requirements (as we don't want an execbuffer4 after VM_BIND).
>>>>
>>>> Why not? it's not like adding extensions here is really that different
>>>> than adding new ioctls.
>>>>
>>>> I definitely think this deserves an execbuffer3 without even
>>>> considering future requirements. Just  to burn down the old
>>>> requirements and pointless fields.
>>>>
>>>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
>>>> older sw on execbuf2 for ever.
>>>
>>> I guess another point in favour of execbuf3 would be that it's less
>>> midlayer. If we share the entry point then there's quite a few vfuncs
>>> needed to cleanly split out the vm_bind paths from the legacy
>>> reloc/softping paths.
>>>
>>> If we invert this and do execbuf3, then there's the existing ioctl
>>> vfunc, and then we share code (where it even makes sense, probably
>>> request setup/submit need to be shared, anything else is probably
>>> cleaner to just copypaste) with the usual helper approach.
>>>
>>> Also that would guarantee that really none of the old concepts like
>>> i915_active on the vma or vma open counts and all that stuff leaks
>>> into the new vm_bind execbuf.
>>>
>>> Finally I also think that copypasting would make backporting easier,
>>> or at least more flexible, since it should make it easier to have the
>>> upstream vm_bind co-exist with all the other things we have. Without
>>> huge amounts of conflicts (or at least much less) that pushing a pile
>>> of vfuncs into the existing code would cause.
>>>
>>> So maybe we should do this?
>>
>> Thanks Dave, Daniel.
>> There are a few things that will be common between execbuf2 and
>> execbuf3, like request setup/submit (as you said), fence handling 
>> (timeline fences, fence array, composite fences), engine selection,
>> etc. Also, many of the 'flags' will be there in execbuf3 also (but
>> bit position will differ).
>> But I guess these should be fine as the suggestion here is to
>> copy-paste the execbuff code and having a shared code where possible.
>> Besides, we can stop supporting some older feature in execbuff3
>> (like fence array in favor of newer timeline fences), which will
>> further reduce common code.
>>
>> Ok, I will update this series by adding execbuf3 and send out soon.
>>
>
> Does this sound reasonable?
>
> struct drm_i915_gem_execbuffer3 {
>        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */
>
>        __u32 batch_count;
>        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu 
> virtual addresses */


Quick question raised on IRC about the batches : Are multiple batches 
limited to virtual engines?


Thanks,


-Lionel


>
>        __u64 flags;
> #define I915_EXEC3_RING_MASK              (0x3f)
> #define I915_EXEC3_DEFAULT                (0<<0)
> #define I915_EXEC3_RENDER                 (1<<0)
> #define I915_EXEC3_BSD                    (2<<0)
> #define I915_EXEC3_BLT                    (3<<0)
> #define I915_EXEC3_VEBOX                  (4<<0)
>
> #define I915_EXEC3_SECURE               (1<<6)
> #define I915_EXEC3_IS_PINNED            (1<<7)
>
> #define I915_EXEC3_BSD_SHIFT     (8)
> #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT)
> #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)
>
> #define I915_EXEC3_FENCE_IN             (1<<10)
> #define I915_EXEC3_FENCE_OUT            (1<<11)
> #define I915_EXEC3_FENCE_SUBMIT         (1<<12)
>
>        __u64 in_out_fence;        /* previously execbuffer2.rsvd2 */
>
>        __u64 extensions;        /* currently only for 
> DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */
> };
>
> With this, user can pass in batch addresses and count directly,
> instead of as an extension (as this rfc series was proposing).
>
> I have removed many of the flags which were either legacy or not
> applicable to BM_BIND mode.
> I have also removed fence array support (execbuffer2.cliprects_ptr)
> as we have timeline fence array support. Is that fine?
> Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
>
> Any thing else needs to be added or removed?
>
> Niranjana
>
>> Niranjana
>>
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 19:37     ` Niranjana Vishwanathapura
@ 2022-06-08  7:17       ` Tvrtko Ursulin
  2022-06-08  9:12         ` Matthew Auld
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  7:17 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: intel-gfx, chris.p.wilson, thomas.hellstrom, dri-devel,
	daniel.vetter, christian.koenig, Matthew Auld


On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
>>
>> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
>>> VM_BIND and related uapi definitions
>>>
>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>     Also add new uapi and documentation as per review comments
>>>     from Daniel.
>>>
>>> Signed-off-by: Niranjana Vishwanathapura 
>>> <niranjana.vishwanathapura@intel.com>
>>> ---
>>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++
>>>  1 file changed, 399 insertions(+)
>>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>
>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>> new file mode 100644
>>> index 000000000000..589c0a009107
>>> --- /dev/null
>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>> @@ -0,0 +1,399 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +
>>> +/**
>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>> + *
>>> + * VM_BIND feature availability.
>>> + * See typedef drm_i915_getparam_t param.
>>> + */
>>> +#define I915_PARAM_HAS_VM_BIND        57
>>> +
>>> +/**
>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>> + *
>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>> + * See struct drm_i915_gem_vm_control flags.
>>> + *
>>> + * A VM in VM_BIND mode will not support the older execbuff mode of 
>>> binding.
>>> + * In VM_BIND mode, execbuff ioctl will not accept any execlist 
>>> (ie., the
>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be 
>>> provided
>>> + * to pass in the batch buffer addresses.
>>> + *
>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must 
>>> always be
>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len 
>>> fields
>>> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
>>> + */
>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>> +
>>> +/**
>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>> + *
>>> + * Flag to declare context as long running.
>>> + * See struct drm_i915_gem_context_create_ext flags.
>>> + *
>>> + * Usage of dma-fence expects that they complete in reasonable 
>>> amount of time.
>>> + * Compute on the other hand can be long running. Hence it is not 
>>> appropriate
>>> + * for compute contexts to export request completion dma-fence to user.
>>> + * The dma-fence usage will be limited to in-kernel consumption only.
>>> + * Compute contexts need to use user/memory fence.
>>> + *
>>> + * So, long running contexts do not support output fences. Hence,
>>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
>>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are 
>>> expected
>>> + * to be not used.
>>> + *
>>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects 
>>> mapped
>>> + * to long running contexts.
>>> + */
>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>> +
>>> +/* VM_BIND related ioctls */
>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f
>>> +
>>> +#define DRM_IOCTL_I915_GEM_VM_BIND        DRM_IOWR(DRM_COMMAND_BASE 
>>> + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND        
>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct 
>>> drm_i915_gem_vm_bind)
>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE    
>>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct 
>>> drm_i915_gem_wait_user_fence)
>>> +
>>> +/**
>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
>>> + *
>>> + * This structure is passed to VM_BIND ioctl and specifies the 
>>> mapping of GPU
>>> + * virtual address (VA) range to the section of an object that 
>>> should be bound
>>> + * in the device page table of the specified address space (VM).
>>> + * The VA range specified must be unique (ie., not currently bound) 
>>> and can
>>> + * be mapped to whole object or a section of the object (partial 
>>> binding).
>>> + * Multiple VA mappings can be created to the same section of the 
>>> object
>>> + * (aliasing).
>>> + */
>>> +struct drm_i915_gem_vm_bind {
>>> +    /** @vm_id: VM (address space) id to bind */
>>> +    __u32 vm_id;
>>> +
>>> +    /** @handle: Object handle */
>>> +    __u32 handle;
>>> +
>>> +    /** @start: Virtual Address start to bind */
>>> +    __u64 start;
>>> +
>>> +    /** @offset: Offset in object to bind */
>>> +    __u64 offset;
>>> +
>>> +    /** @length: Length of mapping to bind */
>>> +    __u64 length;
>>
>> Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? 
>> Or if not userspace is expected to map the remainder of the space to a 
>> dummy object? In which case would there be any alignment/padding 
>> issues preventing the two bind to be placed next to each other?
>>
>> I ask because someone from the compute side asked me about a problem 
>> with their strategy of dealing with overfetch and I suggested pad to 
>> size.
>>
> 
> Thanks Tvrtko,
> I think we shouldn't be needing it. As with VM_BIND VA assignment
> is completely pushed to userspace, no padding should be necessary
> once the 'start' and 'size' alignment conditions are met.
> 
> I will add some documentation on alignment requirement here.
> Generally, 'start' and 'size' should be 4K aligned. But, I think
> when we have 64K lmem page sizes (dg2 and xehpsdv), they need to
> be 64K aligned.

+ Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one 
component and another has to apply more alignment to it, to deal with 
overfetch. Since they cannot grow the actual BO if they wanted to 
VM_BIND a scratch area on top? Or perhaps none of this is a problem on 
discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in 
i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to 
discrete?

Regards,

Tvrtko

> 
> Niranjana
> 
>> Regards,
>>
>> Tvrtko
>>
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_GEM_VM_BIND_READONLY:
>>> +     * Mapping is read-only.
>>> +     *
>>> +     * I915_GEM_VM_BIND_CAPTURE:
>>> +     * Capture this mapping in the dump upon GPU error.
>>> +     */
>>> +    __u64 flags;
>>> +#define I915_GEM_VM_BIND_READONLY    (1 << 0)
>>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)
>>> +
>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>> mapping. */
>>> +    __u64 extensions;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
>>> + *
>>> + * This structure is passed to VM_UNBIND ioctl and specifies the GPU 
>>> virtual
>>> + * address (VA) range that should be unbound from the device page 
>>> table of the
>>> + * specified address space (VM). The specified VA range must match 
>>> one of the
>>> + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
>>> + * completion.
>>> + */
>>> +struct drm_i915_gem_vm_unbind {
>>> +    /** @vm_id: VM (address space) id to bind */
>>> +    __u32 vm_id;
>>> +
>>> +    /** @rsvd: Reserved for future use; must be zero. */
>>> +    __u32 rsvd;
>>> +
>>> +    /** @start: Virtual Address start to unbind */
>>> +    __u64 start;
>>> +
>>> +    /** @length: Length of mapping to unbind */
>>> +    __u64 length;
>>> +
>>> +    /** @flags: reserved for future usage, currently MBZ */
>>> +    __u64 flags;
>>> +
>>> +    /** @extensions: 0-terminated chain of extensions for this 
>>> mapping. */
>>> +    __u64 extensions;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_fence - An input or output fence for the 
>>> vm_bind
>>> + * or the vm_unbind work.
>>> + *
>>> + * The vm_bind or vm_unbind aync worker will wait for input fence to 
>>> signal
>>> + * before starting the binding or unbinding.
>>> + *
>>> + * The vm_bind or vm_unbind async worker will signal the returned 
>>> output fence
>>> + * after the completion of binding or unbinding.
>>> + */
>>> +struct drm_i915_vm_bind_fence {
>>> +    /** @handle: User's handle for a drm_syncobj to wait on or 
>>> signal. */
>>> +    __u32 handle;
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_VM_BIND_FENCE_WAIT:
>>> +     * Wait for the input fence before binding/unbinding
>>> +     *
>>> +     * I915_VM_BIND_FENCE_SIGNAL:
>>> +     * Return bind/unbind completion fence as output
>>> +     */
>>> +    __u32 flags;
>>> +#define I915_VM_BIND_FENCE_WAIT            (1<<0)
>>> +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1)
>>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS 
>>> (-(I915_VM_BIND_FENCE_SIGNAL << 1))
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for 
>>> vm_bind
>>> + * and vm_unbind.
>>> + *
>>> + * This structure describes an array of timeline drm_syncobj and 
>>> associated
>>> + * points for timeline variants of drm_syncobj. These timeline 
>>> 'drm_syncobj's
>>> + * can be input or output fences (See struct drm_i915_vm_bind_fence).
>>> + */
>>> +struct drm_i915_vm_bind_ext_timeline_fences {
>>> +#define I915_VM_BIND_EXT_timeline_FENCES    0
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /**
>>> +     * @fence_count: Number of elements in the @handles_ptr & 
>>> @value_ptr
>>> +     * arrays.
>>> +     */
>>> +    __u64 fence_count;
>>> +
>>> +    /**
>>> +     * @handles_ptr: Pointer to an array of struct 
>>> drm_i915_vm_bind_fence
>>> +     * of length @fence_count.
>>> +     */
>>> +    __u64 handles_ptr;
>>> +
>>> +    /**
>>> +     * @values_ptr: Pointer to an array of u64 values of length
>>> +     * @fence_count.
>>> +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
>>> +     * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
>>> +     * binary one.
>>> +     */
>>> +    __u64 values_ptr;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_user_fence - An input or output user 
>>> fence for the
>>> + * vm_bind or the vm_unbind work.
>>> + *
>>> + * The vm_bind or vm_unbind aync worker will wait for the input 
>>> fence (value at
>>> + * @addr to become equal to @val) before starting the binding or 
>>> unbinding.
>>> + *
>>> + * The vm_bind or vm_unbind async worker will signal the output 
>>> fence after
>>> + * the completion of binding or unbinding by writing @val to memory 
>>> location at
>>> + * @addr
>>> + */
>>> +struct drm_i915_vm_bind_user_fence {
>>> +    /** @addr: User/Memory fence qword aligned process virtual 
>>> address */
>>> +    __u64 addr;
>>> +
>>> +    /** @val: User/Memory fence value to be written after bind 
>>> completion */
>>> +    __u64 val;
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_VM_BIND_USER_FENCE_WAIT:
>>> +     * Wait for the input fence before binding/unbinding
>>> +     *
>>> +     * I915_VM_BIND_USER_FENCE_SIGNAL:
>>> +     * Return bind/unbind completion fence as output
>>> +     */
>>> +    __u32 flags;
>>> +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0)
>>> +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1)
>>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
>>> +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for 
>>> vm_bind
>>> + * and vm_unbind.
>>> + *
>>> + * These user fences can be input or output fences
>>> + * (See struct drm_i915_vm_bind_user_fence).
>>> + */
>>> +struct drm_i915_vm_bind_ext_user_fence {
>>> +#define I915_VM_BIND_EXT_USER_FENCES    1
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /** @fence_count: Number of elements in the @user_fence_ptr 
>>> array. */
>>> +    __u64 fence_count;
>>> +
>>> +    /**
>>> +     * @user_fence_ptr: Pointer to an array of
>>> +     * struct drm_i915_vm_bind_user_fence of length @fence_count.
>>> +     */
>>> +    __u64 user_fence_ptr;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of 
>>> batch buffer
>>> + * gpu virtual addresses.
>>> + *
>>> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this 
>>> extension
>>> + * must always be appended in the VM_BIND mode and it will be an 
>>> error to
>>> + * append this extension in older non-VM_BIND mode.
>>> + */
>>> +struct drm_i915_gem_execbuffer_ext_batch_addresses {
>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /** @count: Number of addresses in the addr array. */
>>> +    __u32 count;
>>> +
>>> +    /** @addr: An array of batch gpu virtual addresses. */
>>> +    __u64 addr[0];
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch 
>>> completion
>>> + * signaling extension.
>>> + *
>>> + * This extension allows user to attach a user fence (@addr, @value 
>>> pair) to an
>>> + * execbuf to be signaled by the command streamer after the 
>>> completion of first
>>> + * level batch, by writing the @value at specified @addr and 
>>> triggering an
>>> + * interrupt.
>>> + * User can either poll for this user fence to signal or can also 
>>> wait on it
>>> + * with i915_gem_wait_user_fence ioctl.
>>> + * This is very much usefaul for long running contexts where waiting 
>>> on dma-fence
>>> + * by user (like i915_gem_wait ioctl) is not supported.
>>> + */
>>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /**
>>> +     * @addr: User/Memory fence qword aligned GPU virtual address.
>>> +     *
>>> +     * Address has to be a valid GPU virtual address at the time of
>>> +     * first level batch completion.
>>> +     */
>>> +    __u64 addr;
>>> +
>>> +    /**
>>> +     * @value: User/Memory fence Value to be written to above address
>>> +     * after first level batch completes.
>>> +     */
>>> +    __u64 value;
>>> +
>>> +    /** @rsvd: Reserved for future extensions, MBZ */
>>> +    __u64 rsvd;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_create_ext_vm_private - Extension to make the 
>>> object
>>> + * private to the specified VM.
>>> + *
>>> + * See struct drm_i915_gem_create_ext.
>>> + */
>>> +struct drm_i915_gem_create_ext_vm_private {
>>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2
>>> +    /** @base: Extension link. See struct i915_user_extension. */
>>> +    struct i915_user_extension base;
>>> +
>>> +    /** @vm_id: Id of the VM to which the object is private */
>>> +    __u32 vm_id;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
>>> + *
>>> + * User/Memory fence can be woken up either by:
>>> + *
>>> + * 1. GPU context indicated by @ctx_id, or,
>>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>>> + *    @ctx_id is ignored when this flag is set.
>>> + *
>>> + * Wakeup condition is,
>>> + * ``((*addr & mask) op (value & mask))``
>>> + *
>>> + * See :ref:`Documentation/driver-api/dma-buf.rst 
>>> <indefinite_dma_fences>`
>>> + */
>>> +struct drm_i915_gem_wait_user_fence {
>>> +    /** @extensions: Zero-terminated chain of extensions. */
>>> +    __u64 extensions;
>>> +
>>> +    /** @addr: User/Memory fence address */
>>> +    __u64 addr;
>>> +
>>> +    /** @ctx_id: Id of the Context which will signal the fence. */
>>> +    __u32 ctx_id;
>>> +
>>> +    /** @op: Wakeup condition operator */
>>> +    __u16 op;
>>> +#define I915_UFENCE_WAIT_EQ      0
>>> +#define I915_UFENCE_WAIT_NEQ     1
>>> +#define I915_UFENCE_WAIT_GT      2
>>> +#define I915_UFENCE_WAIT_GTE     3
>>> +#define I915_UFENCE_WAIT_LT      4
>>> +#define I915_UFENCE_WAIT_LTE     5
>>> +#define I915_UFENCE_WAIT_BEFORE  6
>>> +#define I915_UFENCE_WAIT_AFTER   7
>>> +
>>> +    /**
>>> +     * @flags: Supported flags are,
>>> +     *
>>> +     * I915_UFENCE_WAIT_SOFT:
>>> +     *
>>> +     * To be woken up by i915 driver async worker (not by GPU).
>>> +     *
>>> +     * I915_UFENCE_WAIT_ABSTIME:
>>> +     *
>>> +     * Wait timeout specified as absolute time.
>>> +     */
>>> +    __u16 flags;
>>> +#define I915_UFENCE_WAIT_SOFT    0x1
>>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>>> +
>>> +    /** @value: Wakeup value */
>>> +    __u64 value;
>>> +
>>> +    /** @mask: Wakeup mask */
>>> +    __u64 mask;
>>> +#define I915_UFENCE_WAIT_U8     0xffu
>>> +#define I915_UFENCE_WAIT_U16    0xffffu
>>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>>> +
>>> +    /**
>>> +     * @timeout: Wait timeout in nanoseconds.
>>> +     *
>>> +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is 
>>> the
>>> +     * absolute time in nsec.
>>> +     */
>>> +    __s64 timeout;
>>> +};

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
  2022-06-07 21:32                   ` Niranjana Vishwanathapura
@ 2022-06-08  7:33                     ` Tvrtko Ursulin
  2022-06-08 21:44                         ` Niranjana Vishwanathapura
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  7:33 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Jason Ekstrand
  Cc: Intel GFX, Chris Wilson, Thomas Hellstrom,
	Maling list - DRI developers, Daniel Vetter,
	Christian König



On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>> On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>  On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>  <niranjana.vishwanathapura@intel.com> wrote:
>>>
>>>    On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>>    >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>    >
>>>    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>    >     <niranjana.vishwanathapura@intel.com> wrote:
>>>    >
>>>    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost 
>>> wrote:
>>>    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>    wrote:
>>>    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>    >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>    binding/unbinding
>>>    >       the mapping in an
>>>    >       >> > +async worker. The binding and unbinding will work 
>>> like a
>>>    special
>>>    >       GPU engine.
>>>    >       >> > +The binding and unbinding operations are serialized and
>>>    will
>>>    >       wait on specified
>>>    >       >> > +input fences before the operation and will signal the
>>>    output
>>>    >       fences upon the
>>>    >       >> > +completion of the operation. Due to serialization,
>>>    completion of
>>>    >       an operation
>>>    >       >> > +will also indicate that all previous operations are 
>>> also
>>>    >       complete.
>>>    >       >>
>>>    >       >> I guess we should avoid saying "will immediately start
>>>    >       binding/unbinding" if
>>>    >       >> there are fences involved.
>>>    >       >>
>>>    >       >> And the fact that it's happening in an async worker 
>>> seem to
>>>    imply
>>>    >       it's not
>>>    >       >> immediate.
>>>    >       >>
>>>    >
>>>    >       Ok, will fix.
>>>    >       This was added because in earlier design binding was deferred
>>>    until
>>>    >       next execbuff.
>>>    >       But now it is non-deferred (immediate in that sense). But 
>>> yah,
>>>    this is
>>>    >       confusing
>>>    >       and will fix it.
>>>    >
>>>    >       >>
>>>    >       >> I have a question on the behavior of the bind operation 
>>> when
>>>    no
>>>    >       input fence
>>>    >       >> is provided. Let say I do :
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence1)
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence2)
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence3)
>>>    >       >>
>>>    >       >>
>>>    >       >> In what order are the fences going to be signaled?
>>>    >       >>
>>>    >       >> In the order of VM_BIND ioctls? Or out of order?
>>>    >       >>
>>>    >       >> Because you wrote "serialized I assume it's : in order
>>>    >       >>
>>>    >
>>>    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind 
>>> and
>>>    unbind
>>>    >       will use
>>>    >       the same queue and hence are ordered.
>>>    >
>>>    >       >>
>>>    >       >> One thing I didn't realize is that because we only get one
>>>    >       "VM_BIND" engine,
>>>    >       >> there is a disconnect from the Vulkan specification.
>>>    >       >>
>>>    >       >> In Vulkan VM_BIND operations are serialized but per 
>>> engine.
>>>    >       >>
>>>    >       >> So you could have something like this :
>>>    >       >>
>>>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>>    >       >>
>>>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>>    >       >>
>>>    >       >>
>>>    >       >> fence1 is not signaled
>>>    >       >>
>>>    >       >> fence3 is signaled
>>>    >       >>
>>>    >       >> So the second VM_BIND will proceed before the first 
>>> VM_BIND.
>>>    >       >>
>>>    >       >>
>>>    >       >> I guess we can deal with that scenario in userspace by 
>>> doing
>>>    the
>>>    >       wait
>>>    >       >> ourselves in one thread per engines.
>>>    >       >>
>>>    >       >> But then it makes the VM_BIND input fences useless.
>>>    >       >>
>>>    >       >>
>>>    >       >> Daniel : what do you think? Should be rework this or just
>>>    deal with
>>>    >       wait
>>>    >       >> fences in userspace?
>>>    >       >>
>>>    >       >
>>>    >       >My opinion is rework this but make the ordering via an 
>>> engine
>>>    param
>>>    >       optional.
>>>    >       >
>>>    >       >e.g. A VM can be configured so all binds are ordered 
>>> within the
>>>    VM
>>>    >       >
>>>    >       >e.g. A VM can be configured so all binds accept an engine
>>>    argument
>>>    >       (in
>>>    >       >the case of the i915 likely this is a gem context handle) 
>>> and
>>>    binds
>>>    >       >ordered with respect to that engine.
>>>    >       >
>>>    >       >This gives UMDs options as the later likely consumes more 
>>> KMD
>>>    >       resources
>>>    >       >so if a different UMD can live with binds being ordered 
>>> within
>>>    the VM
>>>    >       >they can use a mode consuming less resources.
>>>    >       >
>>>    >
>>>    >       I think we need to be careful here if we are looking for some
>>>    out of
>>>    >       (submission) order completion of vm_bind/unbind.
>>>    >       In-order completion means, in a batch of binds and unbinds 
>>> to be
>>>    >       completed in-order, user only needs to specify in-fence 
>>> for the
>>>    >       first bind/unbind call and the our-fence for the last
>>>    bind/unbind
>>>    >       call. Also, the VA released by an unbind call can be 
>>> re-used by
>>>    >       any subsequent bind call in that in-order batch.
>>>    >
>>>    >       These things will break if binding/unbinding were to be 
>>> allowed
>>>    to
>>>    >       go out of order (of submission) and user need to be extra
>>>    careful
>>>    >       not to run into pre-mature triggereing of out-fence and bind
>>>    failing
>>>    >       as VA is still in use etc.
>>>    >
>>>    >       Also, VM_BIND binds the provided mapping on the specified
>>>    address
>>>    >       space
>>>    >       (VM). So, the uapi is not engine/context specific.
>>>    >
>>>    >       We can however add a 'queue' to the uapi which can be one 
>>> from
>>>    the
>>>    >       pre-defined queues,
>>>    >       I915_VM_BIND_QUEUE_0
>>>    >       I915_VM_BIND_QUEUE_1
>>>    >       ...
>>>    >       I915_VM_BIND_QUEUE_(N-1)
>>>    >
>>>    >       KMD will spawn an async work queue for each queue which will
>>>    only
>>>    >       bind the mappings on that queue in the order of submission.
>>>    >       User can assign the queue to per engine or anything like 
>>> that.
>>>    >
>>>    >       But again here, user need to be careful and not deadlock 
>>> these
>>>    >       queues with circular dependency of fences.
>>>    >
>>>    >       I prefer adding this later an as extension based on 
>>> whether it
>>>    >       is really helping with the implementation.
>>>    >
>>>    >     I can tell you right now that having everything on a single
>>>    in-order
>>>    >     queue will not get us the perf we want.  What vulkan really 
>>> wants
>>>    is one
>>>    >     of two things:
>>>    >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>>    whatever
>>>    >     their dependencies are resolved and we ensure ordering 
>>> ourselves
>>>    by
>>>    >     having a syncobj in the VkQueue.
>>>    >      2. The ability to create multiple VM_BIND queues.  We need at
>>>    least 2
>>>    >     but I don't see why there needs to be a limit besides the 
>>> limits
>>>    the
>>>    >     i915 API already has on the number of engines.  Vulkan could
>>>    expose
>>>    >     multiple sparse binding queues to the client if it's not
>>>    arbitrarily
>>>    >     limited.
>>>
>>>    Thanks Jason, Lionel.
>>>
>>>    Jason, what are you referring to when you say "limits the i915 API
>>>    already
>>>    has on the number of engines"? I am not sure if there is such an uapi
>>>    today.
>>>
>>>  There's a limit of something like 64 total engines today based on the
>>>  number of bits we can cram into the exec flags in execbuffer2.  I think
>>>  someone had an extended version that allowed more but I ripped it out
>>>  because no one was using it.  Of course, execbuffer3 might not have 
>>> that
>>>  problem at all.
>>>
>>
>> Thanks Jason.
>> Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>> will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>> and somehow export it to user (I am thinking of embedding it in
>> I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>> queues.
> 
> Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
> will also have. So, we can simply define in vm_bind/unbind structures,
> 
> #define I915_VM_BIND_MAX_QUEUE   64
>         __u32 queue;
> 
> I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines hardware 
can have? I suggest not to do that.

Change with added this:

	if (set.num_engines > I915_EXEC_RING_MASK + 1)
		return -EINVAL;

To context creation needs to be undone and so let users create engine 
maps with all hardware engines, and let execbuf3 access them all.

Regards,

Tvrtko

> 
> Niranjana
> 
>>
>>>    I am trying to see how many queues we need and don't want it to be
>>>    arbitrarily
>>>    large and unduely blow up memory usage and complexity in i915 driver.
>>>
>>>  I expect a Vulkan driver to use at most 2 in the vast majority of 
>>> cases. I
>>>  could imagine a client wanting to create more than 1 sparse queue in 
>>> which
>>>  case, it'll be N+1 but that's unlikely.  As far as complexity goes, 
>>> once
>>>  you allow two, I don't think the complexity is going up by allowing 
>>> N.  As
>>>  for memory usage, creating more queues means more memory.  That's a
>>>  trade-off that userspace can make.  Again, the expected number here 
>>> is 1
>>>  or 2 in the vast majority of cases so I don't think you need to worry.
>>
>> Ok, will start with n=3 meaning 8 queues.
>> That would require us create 8 workqueues.
>> We can change 'n' later if required.
>>
>> Niranjana
>>
>>>
>>>    >     Why?  Because Vulkan has two basic kind of bind operations 
>>> and we
>>>    don't
>>>    >     want any dependencies between them:
>>>    >      1. Immediate.  These happen right after BO creation or 
>>> maybe as
>>>    part of
>>>    >     vkBindImageMemory() or VkBindBufferMemory().  These don't 
>>> happen
>>>    on a
>>>    >     queue and we don't want them serialized with anything.  To
>>>    synchronize
>>>    >     with submit, we'll have a syncobj in the VkDevice which is
>>>    signaled by
>>>    >     all immediate bind operations and make submits wait on it.
>>>    >      2. Queued (sparse): These happen on a VkQueue which may be the
>>>    same as
>>>    >     a render/compute queue or may be its own queue.  It's up to us
>>>    what we
>>>    >     want to advertise.  From the Vulkan API PoV, this is like any
>>>    other
>>>    >     queue.  Operations on it wait on and signal semaphores.  If we
>>>    have a
>>>    >     VM_BIND engine, we'd provide syncobjs to wait and signal 
>>> just like
>>>    we do
>>>    >     in execbuf().
>>>    >     The important thing is that we don't want one type of 
>>> operation to
>>>    block
>>>    >     on the other.  If immediate binds are blocking on sparse binds,
>>>    it's
>>>    >     going to cause over-synchronization issues.
>>>    >     In terms of the internal implementation, I know that there's 
>>> going
>>>    to be
>>>    >     a lock on the VM and that we can't actually do these things in
>>>    >     parallel.  That's fine.  Once the dma_fences have signaled and
>>>    we're
>>>
>>>    Thats correct. It is like a single VM_BIND engine with multiple 
>>> queues
>>>    feeding to it.
>>>
>>>  Right.  As long as the queues themselves are independent and can 
>>> block on
>>>  dma_fences without holding up other queues, I think we're fine.
>>>
>>>    >     unblocked to do the bind operation, I don't care if there's 
>>> a bit
>>>    of
>>>    >     synchronization due to locking.  That's expected.  What we 
>>> can't
>>>    afford
>>>    >     to have is an immediate bind operation suddenly blocking on a
>>>    sparse
>>>    >     operation which is blocked on a compute job that's going to run
>>>    for
>>>    >     another 5ms.
>>>
>>>    As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>>    VM_BIND
>>>    on other VMs. I am not sure about usecases here, but just wanted to
>>>    clarify.
>>>
>>>  Yes, that's what I would expect.
>>>  --Jason
>>>
>>>    Niranjana
>>>
>>>    >     For reference, Windows solves this by allowing arbitrarily many
>>>    paging
>>>    >     queues (what they call a VM_BIND engine/queue).  That design 
>>> works
>>>    >     pretty well and solves the problems in question.  Again, we 
>>> could
>>>    just
>>>    >     make everything out-of-order and require using syncobjs to 
>>> order
>>>    things
>>>    >     as userspace wants. That'd be fine too.
>>>    >     One more note while I'm here: danvet said something on IRC 
>>> about
>>>    VM_BIND
>>>    >     queues waiting for syncobjs to materialize.  We don't really
>>>    want/need
>>>    >     this.  We already have all the machinery in userspace to handle
>>>    >     wait-before-signal and waiting for syncobj fences to 
>>> materialize
>>>    and
>>>    >     that machinery is on by default.  It would actually take 
>>> MORE work
>>>    in
>>>    >     Mesa to turn it off and take advantage of the kernel being 
>>> able to
>>>    wait
>>>    >     for syncobjs to materialize.  Also, getting that right is
>>>    ridiculously
>>>    >     hard and I really don't want to get it wrong in kernel 
>>> space.     When we
>>>    >     do memory fences, wait-before-signal will be a thing.  We don't
>>>    need to
>>>    >     try and make it a thing for syncobj.
>>>    >     --Jason
>>>    >
>>>    >   Thanks Jason,
>>>    >
>>>    >   I missed the bit in the Vulkan spec that we're allowed to have a
>>>    sparse
>>>    >   queue that does not implement either graphics or compute 
>>> operations
>>>    :
>>>    >
>>>    >     "While some implementations may include
>>>    VK_QUEUE_SPARSE_BINDING_BIT
>>>    >     support in queue families that also include
>>>    >
>>>    >      graphics and compute support, other implementations may only
>>>    expose a
>>>    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>    >
>>>    >      family."
>>>    >
>>>    >   So it can all be all a vm_bind engine that just does bind/unbind
>>>    >   operations.
>>>    >
>>>    >   But yes we need another engine for the immediate/non-sparse
>>>    operations.
>>>    >
>>>    >   -Lionel
>>>    >
>>>    >         >
>>>    >       Daniel, any thoughts?
>>>    >
>>>    >       Niranjana
>>>    >
>>>    >       >Matt
>>>    >       >
>>>    >       >>
>>>    >       >> Sorry I noticed this late.
>>>    >       >>
>>>    >       >>
>>>    >       >> -Lionel
>>>    >       >>
>>>    >       >>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition
  2022-06-07 21:25                 ` Niranjana Vishwanathapura
@ 2022-06-08  7:34                   ` Tvrtko Ursulin
  2022-06-08 19:52                     ` Niranjana Vishwanathapura
  0 siblings, 1 reply; 121+ messages in thread
From: Tvrtko Ursulin @ 2022-06-08  7:34 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Zanoni, Paulo R, intel-gfx, dri-devel, Hellstrom, Thomas, Wilson,
	Chris P, Vetter, Daniel, christian.koenig


On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
>>
>> On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
>>> On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
>>> wrote:
>>>> On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
>>>>> On Wed, 1 Jun 2022 at 11:03, Dave Airlie <airlied@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>
>>>>>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>>>>>>>> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>
>>>>>>>>> v2: Ensure proper kernel-doc formatting with cross references.
>>>>>>>>>      Also add new uapi and documentation as per review comments
>>>>>>>>>      from Daniel.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura 
>>>>>>> <niranjana.vishwanathapura@intel.com>
>>>>>>>>> ---
>>>>>>>>>   Documentation/gpu/rfc/i915_vm_bind.h | 399 
>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>>   1 file changed, 399 insertions(+)
>>>>>>>>>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>
>>>>>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>> new file mode 100644
>>>>>>>>> index 000000000000..589c0a009107
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>> @@ -0,0 +1,399 @@
>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>> +/*
>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>> + *
>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>> + */
>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND               57
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>> + *
>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>> + *
>>>>>>>>> + * A VM in VM_BIND mode will not support the older 
>>>>>>> execbuff mode of binding.
>>>>>>>>> + * In VM_BIND mode, execbuff ioctl will not accept any 
>>>>>>> execlist (ie., the
>>>>>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0).
>>>>>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and
>>>>>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0.
>>>>>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
>>>>>>> must be provided
>>>>>>>>> + * to pass in the batch buffer addresses.
>>>>>>>>> + *
>>>>>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>>>>>>>>> + * I915_EXEC_BATCH_FIRST of 
>>>>>>> &drm_i915_gem_execbuffer2.flags must be 0
>>>>>>>>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS 
>>>>>>> flag must always be
>>>>>>>>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>>>>>>> + * The buffers_ptr, buffer_count, batch_start_offset and 
>>>>>>> batch_len fields
>>>>>>>>> + * of struct drm_i915_gem_execbuffer2 are also not used 
>>>>>>> and must be 0.
>>>>>>>>> + */
>>>>>>>>
>>>>>>>> From that description, it seems we have:
>>>>>>>>
>>>>>>>> struct drm_i915_gem_execbuffer2 {
>>>>>>>>         __u64 buffers_ptr;              -> must be 0 (new)
>>>>>>>>         __u32 buffer_count;             -> must be 0 (new)
>>>>>>>>         __u32 batch_start_offset;       -> must be 0 (new)
>>>>>>>>         __u32 batch_len;                -> must be 0 (new)
>>>>>>>>         __u32 DR1;                      -> must be 0 (old)
>>>>>>>>         __u32 DR4;                      -> must be 0 (old)
>>>>>>>>         __u32 num_cliprects; (fences)   -> must be 0 since 
>>>>>>> using extensions
>>>>>>>>         __u64 cliprects_ptr; (fences, extensions) -> 
>>>>>>> contains an actual pointer!
>>>>>>>>         __u64 flags;                    -> some flags must be 0 
>>>>>>>> (new)
>>>>>>>>         __u64 rsvd1; (context info)     -> repurposed field (old)
>>>>>>>>         __u64 rsvd2;                    -> unused
>>>>>>>> };
>>>>>>>>
>>>>>>>> Based on that, why can't we just get drm_i915_gem_execbuffer3 
>>>>>>>> instead
>>>>>>>> of adding even more complexity to an already abused interface? 
>>>>>>>> While
>>>>>>>>