All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/2] drm/doc/rfc: i915 VM_BIND feature design + uapi
@ 2022-03-07 20:31 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-03-07 20:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, thomas.hellstrom, chris.p.wilson

This is the i915 driver VM_BIND feature design RFC patch series along
with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Niranjana Vishwanathapura (2):
  drm/doc/rfc: VM_BIND feature design document
  drm/doc/rfc: VM_BIND uapi definition

 Documentation/gpu/rfc/i915_vm_bind.h   | 176 +++++++++++++++++++++
 Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 390 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] [RFC v2 0/2] drm/doc/rfc: i915 VM_BIND feature design + uapi
@ 2022-03-07 20:31 ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-03-07 20:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, thomas.hellstrom, chris.p.wilson

This is the i915 driver VM_BIND feature design RFC patch series along
with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Niranjana Vishwanathapura (2):
  drm/doc/rfc: VM_BIND feature design document
  drm/doc/rfc: VM_BIND uapi definition

 Documentation/gpu/rfc/i915_vm_bind.h   | 176 +++++++++++++++++++++
 Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 390 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-07 20:31 ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-03-07 20:31   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-03-07 20:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, thomas.hellstrom, chris.p.wilson

VM_BIND design document with description of intended use cases.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 2 files changed, 214 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..cdc6bb25b942
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,210 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on
+a specified address space (VM).
+
+These mappings (also referred to as persistent mappings) will be persistent
+across multiple GPU submissions (execbuff) issued by the UMD, without user
+having to provide a list of all required mappings during each submission
+(as required by older execbuff mode).
+
+VM_BIND ioctl deferes binding the mappings until next execbuff submission
+where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
+flag is set (useful if mapping is required for an active context).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+A VM in VM_BIND mode will not support older execbuff mode of binding.
+
+UMDs can still send BOs of these persistent mappings in execlist of execbuff
+for specifying BO dependencies (implicit fencing) and to use BO as a batch,
+but those BOs should be mapped ahead via vm_bind ioctl.
+
+VM_BIND features include,
+- Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding).
+- Support capture of persistent mappings in the dump upon GPU error.
+- TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support.
+- VM_BIND uses user/memory fence mechanism for signaling bind completion
+  and for signaling batch completion in long running contexts (explained
+  below).
+
+VM_PRIVATE objects
+------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+VM_BIND locking order is as below.
+
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
+   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple pagefault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
+   while binding a vma and while updating dma-resv fence list of a BO.
+   The private BOs of a VM will all share a dma-resv object.
+
+   This lock is held in vm_bind call for immediate binding, during vm_unbind
+   call for unbinding and during execbuff path for binding the mapping and
+   updating the dma-resv fence list of the BO.
+
+3) Spinlock/s to protect some of the VM's lists.
+
+We will also need support for bluk LRU movement of persistent mapping to
+avoid additional latencies in execbuff path.
+
+GPU page faults
+----------------
+Both older execbuff mode and the newer VM_BIND mode of binding will require
+using dma-fence to ensure residency.
+In future when GPU page faults are supported, no dma-fence usage is required
+as residency is purely managed by installing and removing/invalidating ptes.
+
+
+User/Memory Fence
+==================
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter.
+
+User/Memory fence is a <address, value> pair. To signal the user fence,
+specified value will be written at the specified virtual address and
+wakeup the waiting process. User can wait on an user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal an user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+
+VM_BIND use cases
+==================
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence. Compute must opt-in for this mechanism with
+I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
+and implicit dependency setting is not allowed on long running contexts.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context fence (called suspend fence) proxying as
+i915_request fence. This suspend fence is enabled when there is a wait on it,
+which triggers the context preemption.
+
+This is much easier to support with VM_BIND compared to the current heavier
+execbuff path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debuggee) and attached
+to GPU via vm_bind interface.
+
+Mesa/Valkun
+------------
+VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
+performance. For Vulkan it should be straightforward to use VM_BIND.
+For Iris implicit buffer tracking must be implemented before we can harness
+VM_BIND benefits. With increasing GPU hardware performance reducing CPU
+overhead becomes more important.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+usecases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Make pagetable allocations evictable and manage them similar to VM_BIND
+  mapped objects. Page table pages are similar to persistent mappings of a
+  VM (difference here are that the page table pages will not
+  have an i915_vma structure and after swapping pages back in, parent page
+  link needs to be updated).
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. Instead use underlying BO's
+  dma-resv fence list to determine if a i915_vma is active or not.
+
+These can be worked upon after intitial vm_bind support is added.
+
+
+UAPI
+=====
+Uapi definiton can be found here:
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-03-07 20:31   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-03-07 20:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, thomas.hellstrom, chris.p.wilson

VM_BIND design document with description of intended use cases.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 2 files changed, 214 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..cdc6bb25b942
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,210 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on
+a specified address space (VM).
+
+These mappings (also referred to as persistent mappings) will be persistent
+across multiple GPU submissions (execbuff) issued by the UMD, without user
+having to provide a list of all required mappings during each submission
+(as required by older execbuff mode).
+
+VM_BIND ioctl deferes binding the mappings until next execbuff submission
+where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
+flag is set (useful if mapping is required for an active context).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+A VM in VM_BIND mode will not support older execbuff mode of binding.
+
+UMDs can still send BOs of these persistent mappings in execlist of execbuff
+for specifying BO dependencies (implicit fencing) and to use BO as a batch,
+but those BOs should be mapped ahead via vm_bind ioctl.
+
+VM_BIND features include,
+- Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding).
+- Support capture of persistent mappings in the dump upon GPU error.
+- TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support.
+- VM_BIND uses user/memory fence mechanism for signaling bind completion
+  and for signaling batch completion in long running contexts (explained
+  below).
+
+VM_PRIVATE objects
+------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+VM_BIND locking order is as below.
+
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
+   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple pagefault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
+   while binding a vma and while updating dma-resv fence list of a BO.
+   The private BOs of a VM will all share a dma-resv object.
+
+   This lock is held in vm_bind call for immediate binding, during vm_unbind
+   call for unbinding and during execbuff path for binding the mapping and
+   updating the dma-resv fence list of the BO.
+
+3) Spinlock/s to protect some of the VM's lists.
+
+We will also need support for bluk LRU movement of persistent mapping to
+avoid additional latencies in execbuff path.
+
+GPU page faults
+----------------
+Both older execbuff mode and the newer VM_BIND mode of binding will require
+using dma-fence to ensure residency.
+In future when GPU page faults are supported, no dma-fence usage is required
+as residency is purely managed by installing and removing/invalidating ptes.
+
+
+User/Memory Fence
+==================
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter.
+
+User/Memory fence is a <address, value> pair. To signal the user fence,
+specified value will be written at the specified virtual address and
+wakeup the waiting process. User can wait on an user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal an user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+
+VM_BIND use cases
+==================
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence. Compute must opt-in for this mechanism with
+I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
+and implicit dependency setting is not allowed on long running contexts.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context fence (called suspend fence) proxying as
+i915_request fence. This suspend fence is enabled when there is a wait on it,
+which triggers the context preemption.
+
+This is much easier to support with VM_BIND compared to the current heavier
+execbuff path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debuggee) and attached
+to GPU via vm_bind interface.
+
+Mesa/Valkun
+------------
+VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
+performance. For Vulkan it should be straightforward to use VM_BIND.
+For Iris implicit buffer tracking must be implemented before we can harness
+VM_BIND benefits. With increasing GPU hardware performance reducing CPU
+overhead becomes more important.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+usecases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Make pagetable allocations evictable and manage them similar to VM_BIND
+  mapped objects. Page table pages are similar to persistent mappings of a
+  VM (difference here are that the page table pages will not
+  have an i915_vma structure and after swapping pages back in, parent page
+  link needs to be updated).
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. Instead use underlying BO's
+  dma-resv fence list to determine if a i915_vma is active or not.
+
+These can be worked upon after intitial vm_bind support is added.
+
+
+UAPI
+=====
+Uapi definiton can be found here:
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC v2 2/2] drm/doc/rfc: VM_BIND uapi definition
  2022-03-07 20:31 ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-03-07 20:31   ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-03-07 20:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, thomas.hellstrom, chris.p.wilson

VM_BIND und related uapi definitions

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index 000000000000..80f00ee6c8a1
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,176 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/* VM_BIND feature availability through drm_i915_getparam */
+#define I915_PARAM_HAS_VM_BIND		57
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind.
+ */
+struct drm_i915_gem_vm_bind {
+	/** vm to [un]bind */
+	__u32 vm_id;
+
+	/**
+	 * BO handle or file descriptor.
+	 * 'fd' value of -1 is reserved for system pages (SVM)
+	 */
+	union {
+		__u32 handle; /* For unbind, it is reserved and must be 0 */
+		__s32 fd;
+	}
+
+	/** VA start to [un]bind */
+	__u64 start;
+
+	/** Offset in object to [un]bind */
+	__u64 offset;
+
+	/** VA length to [un]bind */
+	__u64 length;
+
+	/** Flags */
+	__u64 flags;
+	/** Bind the mapping immediately instead of during next submission */
+#define I915_GEM_VM_BIND_IMMEDIATE   (1 << 0)
+	/** Read-only mapping */
+#define I915_GEM_VM_BIND_READONLY    (1 << 1)
+	/** Capture this mapping in the dump upon GPU error */
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 2)
+
+	/** Zero-terminated chain of extensions */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension.
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCE	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** User/Memory fence qword alinged process virtual address */
+	__u64 addr;
+
+	/** User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/** Reserved for future extensions */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (<addr, value> pair) to an
+ * execbuf to be signaled by the command streamer after the completion of 1st
+ * level batch, by writing the <value> at specified <addr> and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * User/Memory fence qword aligned GPU virtual address.
+	 * Address has to be a valid GPU virtual address at the time of
+	 * 1st level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * User/Memory fence Value to be written to above address
+	 * after 1st level batch completes.
+	 */
+	__u64 value;
+
+	/** Reserved for future extensions */
+	__u64 rsvd;
+};
+
+struct drm_i915_gem_vm_control {
+/** Flag to opt-in for VM_BIND mode of binding during VM creation */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+};
+
+
+struct drm_i915_gem_create_ext {
+/** Extension to make the object private to a specified VM */
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+};
+
+
+struct prelim_drm_i915_gem_context_create_ext {
+/** Flag to declare context as long running */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence
+ *
+ * Wait on user/memory fence. User/Memory fence can be woken up either by,
+ *    1. GPU context indicated by 'ctx_id', or,
+ *    2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *       'ctx_id' is ignored when this flag is set.
+ *
+ * Wakeup when below condition is true.
+ * (*addr & MASK) OP (VALUE & MASK)
+ *
+ */
+~struct drm_i915_gem_wait_user_fence {
+	/** @base: Extension link. See struct i915_user_extension. */
+	__u64 extensions;
+
+	/** User/Memory fence address */
+	__u64 addr;
+
+	/** Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/** Flags */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** Wakeup value */
+	__u64 value;
+
+	/** Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/** Timeout */
+	__s64 timeout;
+};
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [RFC v2 2/2] drm/doc/rfc: VM_BIND uapi definition
@ 2022-03-07 20:31   ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-03-07 20:31 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniel.vetter, thomas.hellstrom, chris.p.wilson

VM_BIND und related uapi definitions

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index 000000000000..80f00ee6c8a1
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,176 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/* VM_BIND feature availability through drm_i915_getparam */
+#define I915_PARAM_HAS_VM_BIND		57
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND		0x3d
+#define DRM_I915_GEM_VM_UNBIND		0x3e
+#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind.
+ */
+struct drm_i915_gem_vm_bind {
+	/** vm to [un]bind */
+	__u32 vm_id;
+
+	/**
+	 * BO handle or file descriptor.
+	 * 'fd' value of -1 is reserved for system pages (SVM)
+	 */
+	union {
+		__u32 handle; /* For unbind, it is reserved and must be 0 */
+		__s32 fd;
+	}
+
+	/** VA start to [un]bind */
+	__u64 start;
+
+	/** Offset in object to [un]bind */
+	__u64 offset;
+
+	/** VA length to [un]bind */
+	__u64 length;
+
+	/** Flags */
+	__u64 flags;
+	/** Bind the mapping immediately instead of during next submission */
+#define I915_GEM_VM_BIND_IMMEDIATE   (1 << 0)
+	/** Read-only mapping */
+#define I915_GEM_VM_BIND_READONLY    (1 << 1)
+	/** Capture this mapping in the dump upon GPU error */
+#define I915_GEM_VM_BIND_CAPTURE     (1 << 2)
+
+	/** Zero-terminated chain of extensions */
+	__u64 extensions;
+};
+
+/**
+ * struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension.
+ */
+struct drm_i915_vm_bind_ext_user_fence {
+#define I915_VM_BIND_EXT_USER_FENCE	0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/** User/Memory fence qword alinged process virtual address */
+	__u64 addr;
+
+	/** User/Memory fence value to be written after bind completion */
+	__u64 val;
+
+	/** Reserved for future extensions */
+	__u64 rsvd;
+};
+
+/**
+ * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
+ * signaling extension.
+ *
+ * This extension allows user to attach a user fence (<addr, value> pair) to an
+ * execbuf to be signaled by the command streamer after the completion of 1st
+ * level batch, by writing the <value> at specified <addr> and triggering an
+ * interrupt.
+ * User can either poll for this user fence to signal or can also wait on it
+ * with i915_gem_wait_user_fence ioctl.
+ * This is very much usefaul for long running contexts where waiting on dma-fence
+ * by user (like i915_gem_wait ioctl) is not supported.
+ */
+struct drm_i915_gem_execbuffer_ext_user_fence {
+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		0
+	/** @base: Extension link. See struct i915_user_extension. */
+	struct i915_user_extension base;
+
+	/**
+	 * User/Memory fence qword aligned GPU virtual address.
+	 * Address has to be a valid GPU virtual address at the time of
+	 * 1st level batch completion.
+	 */
+	__u64 addr;
+
+	/**
+	 * User/Memory fence Value to be written to above address
+	 * after 1st level batch completes.
+	 */
+	__u64 value;
+
+	/** Reserved for future extensions */
+	__u64 rsvd;
+};
+
+struct drm_i915_gem_vm_control {
+/** Flag to opt-in for VM_BIND mode of binding during VM creation */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
+};
+
+
+struct drm_i915_gem_create_ext {
+/** Extension to make the object private to a specified VM */
+#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
+};
+
+
+struct prelim_drm_i915_gem_context_create_ext {
+/** Flag to declare context as long running */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+};
+
+/**
+ * struct drm_i915_gem_wait_user_fence
+ *
+ * Wait on user/memory fence. User/Memory fence can be woken up either by,
+ *    1. GPU context indicated by 'ctx_id', or,
+ *    2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
+ *       'ctx_id' is ignored when this flag is set.
+ *
+ * Wakeup when below condition is true.
+ * (*addr & MASK) OP (VALUE & MASK)
+ *
+ */
+~struct drm_i915_gem_wait_user_fence {
+	/** @base: Extension link. See struct i915_user_extension. */
+	__u64 extensions;
+
+	/** User/Memory fence address */
+	__u64 addr;
+
+	/** Id of the Context which will signal the fence. */
+	__u32 ctx_id;
+
+	/** Wakeup condition operator */
+	__u16 op;
+#define I915_UFENCE_WAIT_EQ      0
+#define I915_UFENCE_WAIT_NEQ     1
+#define I915_UFENCE_WAIT_GT      2
+#define I915_UFENCE_WAIT_GTE     3
+#define I915_UFENCE_WAIT_LT      4
+#define I915_UFENCE_WAIT_LTE     5
+#define I915_UFENCE_WAIT_BEFORE  6
+#define I915_UFENCE_WAIT_AFTER   7
+
+	/** Flags */
+	__u16 flags;
+#define I915_UFENCE_WAIT_SOFT    0x1
+#define I915_UFENCE_WAIT_ABSTIME 0x2
+
+	/** Wakeup value */
+	__u64 value;
+
+	/** Wakeup mask */
+	__u64 mask;
+#define I915_UFENCE_WAIT_U8     0xffu
+#define I915_UFENCE_WAIT_U16    0xffffu
+#define I915_UFENCE_WAIT_U32    0xfffffffful
+#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
+
+	/** Timeout */
+	__s64 timeout;
+};
-- 
2.21.0.rc0.32.g243a4c7e27


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2)
  2022-03-07 20:31 ` [Intel-gfx] " Niranjana Vishwanathapura
                   ` (2 preceding siblings ...)
  (?)
@ 2022-03-07 20:38 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-07 20:38 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2)
URL   : https://patchwork.freedesktop.org/series/93447/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
a40e87e2a2f3 drm/doc/rfc: VM_BIND feature design document
-:11: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#11: 
new file mode 100644

-:16: WARNING:SPDX_LICENSE_TAG: Missing or malformed SPDX-License-Identifier tag in line 1
#16: FILE: Documentation/gpu/rfc/i915_vm_bind.rst:1:
+==========================================

-:112: WARNING:TYPO_SPELLING: 'an user' may be misspelled - perhaps 'a user'?
#112: FILE: Documentation/gpu/rfc/i915_vm_bind.rst:97:
+wakeup the waiting process. User can wait on an user fence with the
                                              ^^^^^^^

-:117: WARNING:TYPO_SPELLING: 'an user' may be misspelled - perhaps 'a user'?
#117: FILE: Documentation/gpu/rfc/i915_vm_bind.rst:102:
+precision on the wakeup. Each batch can signal an user fence to indicate
                                                ^^^^^^^

total: 0 errors, 4 warnings, 0 checks, 217 lines checked
95969abb7e7c drm/doc/rfc: VM_BIND uapi definition
-:11: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#11: 
new file mode 100644

-:29: WARNING:LONG_LINE: line length of 126 exceeds 100 columns
#29: FILE: Documentation/gpu/rfc/i915_vm_bind.h:14:
+#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)

-:30: WARNING:LONG_LINE: line length of 128 exceeds 100 columns
#30: FILE: Documentation/gpu/rfc/i915_vm_bind.h:15:
+#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)

-:31: WARNING:LONG_LINE: line length of 142 exceeds 100 columns
#31: FILE: Documentation/gpu/rfc/i915_vm_bind.h:16:
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

-:129: CHECK:LINE_SPACING: Please don't use multiple blank lines
#129: FILE: Documentation/gpu/rfc/i915_vm_bind.h:114:
+
+

-:135: CHECK:LINE_SPACING: Please don't use multiple blank lines
#135: FILE: Documentation/gpu/rfc/i915_vm_bind.h:120:
+
+

total: 0 errors, 4 warnings, 2 checks, 176 lines checked



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2)
  2022-03-07 20:31 ` [Intel-gfx] " Niranjana Vishwanathapura
                   ` (3 preceding siblings ...)
  (?)
@ 2022-03-07 20:43 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-07 20:43 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2)
URL   : https://patchwork.freedesktop.org/series/93447/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
/home/cidrm/kernel/Documentation/gpu/rfc/i915_vm_bind.rst:31: WARNING: Unexpected indentation.
/home/cidrm/kernel/Documentation/gpu/rfc/i915_vm_bind.rst:32: WARNING: Block quote ends without a blank line; unexpected unindent.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2)
  2022-03-07 20:31 ` [Intel-gfx] " Niranjana Vishwanathapura
                   ` (4 preceding siblings ...)
  (?)
@ 2022-03-08 12:13 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-08 12:13 UTC (permalink / raw)
  To: Niranjana Vishwanathapura; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 15725 bytes --]

== Series Details ==

Series: drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2)
URL   : https://patchwork.freedesktop.org/series/93447/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11334 -> Patchwork_22505
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/index.html

Participating hosts (44 -> 41)
------------------------------

  Additional (3): fi-skl-guc bat-jsl-2 fi-bsw-nick 
  Missing    (6): fi-kbl-soraka bat-dg1-5 fi-tgl-1115g4 fi-bsw-cyan bat-rpls-2 fi-bdw-samus 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_22505:

### IGT changes ###

#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@i915_selftest@live@gt_lrc:
    - {bat-dg2-9}:        NOTRUN -> [INCOMPLETE][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-dg2-9/igt@i915_selftest@live@gt_lrc.html

  * igt@kms_frontbuffer_tracking@basic:
    - {bat-dg2-9}:        [DMESG-FAIL][2] -> [FAIL][3]
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/bat-dg2-9/igt@kms_frontbuffer_tracking@basic.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-dg2-9/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
    - {bat-dg2-9}:        NOTRUN -> [DMESG-WARN][4]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-dg2-9/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

  * igt@prime_vgem@basic-write:
    - {bat-dg2-9}:        NOTRUN -> [SKIP][5] +12 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-dg2-9/igt@prime_vgem@basic-write.html

  
Known issues
------------

  Here are the changes found in Patchwork_22505 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_lmem_swapping@random-engines:
    - fi-skl-guc:         NOTRUN -> [SKIP][6] ([fdo#109271] / [i915#4613]) +3 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-skl-guc/igt@gem_lmem_swapping@random-engines.html

  * igt@gem_lmem_swapping@verify-random:
    - fi-bsw-nick:        NOTRUN -> [SKIP][7] ([fdo#109271]) +67 similar issues
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-bsw-nick/igt@gem_lmem_swapping@verify-random.html

  * igt@i915_pm_rpm@module-reload:
    - fi-kbl-guc:         [PASS][8] -> [SKIP][9] ([fdo#109271])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-kbl-guc/igt@i915_pm_rpm@module-reload.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-kbl-guc/igt@i915_pm_rpm@module-reload.html

  * igt@i915_selftest@live@hangcheck:
    - fi-bdw-5557u:       NOTRUN -> [INCOMPLETE][10] ([i915#3921])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-bdw-5557u/igt@i915_selftest@live@hangcheck.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-skl-guc:         NOTRUN -> [SKIP][11] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-skl-guc/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@vga-edid-read:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][12] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-bdw-5557u/igt@kms_chamelium@vga-edid-read.html
    - fi-bsw-nick:        NOTRUN -> [SKIP][13] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-bsw-nick/igt@kms_chamelium@vga-edid-read.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-skl-guc:         NOTRUN -> [SKIP][14] ([fdo#109271] / [i915#533])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-skl-guc/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  * igt@kms_psr@cursor_plane_move:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][15] ([fdo#109271]) +13 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-bdw-5557u/igt@kms_psr@cursor_plane_move.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-skl-guc:         NOTRUN -> [SKIP][16] ([fdo#109271]) +28 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-skl-guc/igt@kms_psr@primary_mmap_gtt.html

  
#### Possible fixes ####

  * igt@gem_eio@unwedge-stress:
    - {shard-tglu}:       [TIMEOUT][17] ([i915#3063] / [i915#3648]) -> [PASS][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-tglu-4/igt@gem_eio@unwedge-stress.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/shard-tglu-2/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_suspend@basic-s3@smem:
    - fi-bdw-5557u:       [INCOMPLETE][19] ([i915#146]) -> [PASS][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@kms_big_fb@x-tiled-32bpp-rotate-180:
    - {shard-dg1}:        [DMESG-WARN][21] ([i915#3891] / [i915#4935]) -> [PASS][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-dg1-12/igt@kms_big_fb@x-tiled-32bpp-rotate-180.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/shard-dg1-19/igt@kms_big_fb@x-tiled-32bpp-rotate-180.html
    - {shard-tglu}:       [DMESG-WARN][23] ([i915#402]) -> [PASS][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-tglu-5/igt@kms_big_fb@x-tiled-32bpp-rotate-180.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/shard-tglu-1/igt@kms_big_fb@x-tiled-32bpp-rotate-180.html

  * igt@kms_busy@basic@modeset:
    - {bat-adlp-6}:       [DMESG-WARN][25] ([i915#3576]) -> ([PASS][26], [PASS][27])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/bat-adlp-6/igt@kms_busy@basic@modeset.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-adlp-6/igt@kms_busy@basic@modeset.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-adlp-6/igt@kms_busy@basic@modeset.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@b-dsi1:
    - {fi-tgl-dsi}:       [FAIL][28] ([i915#2122]) -> [PASS][29]
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-tgl-dsi/igt@kms_flip@basic-flip-vs-wf_vblank@b-dsi1.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/fi-tgl-dsi/igt@kms_flip@basic-flip-vs-wf_vblank@b-dsi1.html

  
#### Warnings ####

  * igt@i915_selftest@live@hangcheck:
    - bat-dg1-6:          [DMESG-FAIL][30] ([i915#4957]) -> [DMESG-FAIL][31] ([i915#4494] / [i915#4957])
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/bat-dg1-6/igt@i915_selftest@live@hangcheck.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/bat-dg1-6/igt@i915_selftest@live@hangcheck.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109283]: https://bugs.freedesktop.org/show_bug.cgi?id=109283
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [fdo#109309]: https://bugs.freedesktop.org/show_bug.cgi?id=109309
  [fdo#109506]: https://bugs.freedesktop.org/show_bug.cgi?id=109506
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#110189]: https://bugs.freedesktop.org/show_bug.cgi?id=110189
  [fdo#110723]: https://bugs.freedesktop.org/show_bug.cgi?id=110723
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111314]: https://bugs.freedesktop.org/show_bug.cgi?id=111314
  [fdo#111614]: https://bugs.freedesktop.org/show_bug.cgi?id=111614
  [fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [fdo#112022]: https://bugs.freedesktop.org/show_bug.cgi?id=112022
  [fdo#112283]: https://bugs.freedesktop.org/show_bug.cgi?id=112283
  [i915#1063]: https://gitlab.freedesktop.org/drm/intel/issues/1063
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1149]: https://gitlab.freedesktop.org/drm/intel/issues/1149
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1187]: https://gitlab.freedesktop.org/drm/intel/issues/1187
  [i915#132]: https://gitlab.freedesktop.org/drm/intel/issues/132
  [i915#1397]: https://gitlab.freedesktop.org/drm/intel/issues/1397
  [i915#146]: https://gitlab.freedesktop.org/drm/intel/issues/146
  [i915#1769]: https://gitlab.freedesktop.org/drm/intel/issues/1769
  [i915#1825]: https://gitlab.freedesktop.org/drm/intel/issues/1825
  [i915#1839]: https://gitlab.freedesktop.org/drm/intel/issues/1839
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849
  [i915#2122]: https://gitlab.freedesktop.org/drm/intel/issues/2122
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2436]: https://gitlab.freedesktop.org/drm/intel/issues/2436
  [i915#2437]: https://gitlab.freedesktop.org/drm/intel/issues/2437
  [i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
  [i915#2530]: https://gitlab.freedesktop.org/drm/intel/issues/2530
  [i915#2705]: https://gitlab.freedesktop.org/drm/intel/issues/2705
  [i915#280]: https://gitlab.freedesktop.org/drm/intel/issues/280
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2856]: https://gitlab.freedesktop.org/drm/intel/issues/2856
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3002]: https://gitlab.freedesktop.org/drm/intel/issues/3002
  [i915#3063]: https://gitlab.freedesktop.org/drm/intel/issues/3063
  [i915#3281]: https://gitlab.freedesktop.org/drm/intel/issues/3281
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3318]: https://gitlab.freedesktop.org/drm/intel/issues/3318
  [i915#3319]: https://gitlab.freedesktop.org/drm/intel/issues/3319
  [i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
  [i915#3458]: https://gitlab.freedesktop.org/drm/intel/issues/3458
  [i915#3469]: https://gitlab.freedesktop.org/drm/intel/issues/3469
  [i915#3539]: https://gitlab.freedesktop.org/drm/intel/issues/3539
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3576]: https://gitlab.freedesktop.org/drm/intel/issues/3576
  [i915#3580]: https://gitlab.freedesktop.org/drm/intel/issues/3580
  [i915#3637]: https://gitlab.freedesktop.org/drm/intel/issues/3637
  [i915#3638]: https://gitlab.freedesktop.org/drm/intel/issues/3638
  [i915#3648]: https://gitlab.freedesktop.org/drm/intel/issues/3648
  [i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#3719]: https://gitlab.freedesktop.org/drm/intel/issues/3719
  [i915#3734]: https://gitlab.freedesktop.org/drm/intel/issues/3734
  [i915#3804]: https://gitlab.freedesktop.org/drm/intel/issues/3804
  [i915#3828]: https://gitlab.freedesktop.org/drm/intel/issues/3828
  [i915#3840]: https://gitlab.freedesktop.org/drm/intel/issues/3840
  [i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
  [i915#3891]: https://gitlab.freedesktop.org/drm/intel/issues/3891
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#3957]: https://gitlab.freedesktop.org/drm/intel/issues/3957
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#4036]: https://gitlab.freedesktop.org/drm/intel/issues/4036
  [i915#4070]: https://gitlab.freedesktop.org/drm/intel/issues/4070
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
  [i915#426]: https://gitlab.freedesktop.org/drm/intel/issues/426
  [i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
  [i915#4278]: https://gitlab.freedesktop.org/drm/intel/issues/4278
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4387]: https://gitlab.freedesktop.org/drm/intel/issues/4387
  [i915#4494]: https://gitlab.freedesktop.org/drm/intel/issues/4494
  [i915#4525]: https://gitlab.freedesktop.org/drm/intel/issues/4525
  [i915#4538]: https://gitlab.freedesktop.org/drm/intel/issues/4538
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4807]: https://gitlab.freedesktop.org/drm/intel/issues/4807
  [i915#4833]: https://gitlab.freedesktop.org/drm/intel/issues/4833
  [i915#4842]: https://gitlab.freedesktop.org/drm/intel/issues/4842
  [i915#4852]: https://gitlab.freedesktop.org/drm/intel/issues/4852
  [i915#4853]: https://gitlab.freedesktop.org/drm/intel/issues/4853
  [i915#4873]: https://gitlab.freedesktop.org/drm/intel/issues/4873
  [i915#4877]: https://gitlab.freedesktop.org/drm/intel/issues/4877
  [i915#4880]: https://gitlab.freedesktop.org/drm/intel/issues/4880
  [i915#4886]: https://gitlab.freedesktop.org/drm/intel/issues/4886
  [i915#4893]: https://gitlab.freedesktop.org/drm/intel/issues/4893
  [i915#4935]: https://gitlab.freedesktop.org/drm/intel/issues/4935
  [i915#4957]: https://gitlab.freedesktop.org/drm/intel/issues/4957
  [i915#4991]: https://gitlab.freedesktop.org/drm/intel/issues/4991
  [i915#5030]: https://gitlab.freedesktop.org/drm/intel/issues/5030
  [i915#5076]: https://gitlab.freedesktop.org/drm/intel/issues/5076
  [i915#5098]: https://gitlab.freedesktop.org/drm/intel/issues/5098
  [i915#5127]: https://gitlab.freedesktop.org/drm/intel/issues/5127
  [i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
  [i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
  [i915#5257]: https://gitlab.freedesktop.org/drm/intel/issues/5257
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658


Build changes
-------------

  * Linux: CI_DRM_11334 -> Patchwork_22505

  CI-20190529: 20190529
  CI_DRM_11334: e7af229f52672104f4b170304c80e2d6849a2489 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6367: f8eac64564b12326721f1d5bea692bde4fe1ef15 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_22505: 95969abb7e7c31eccfac16660789236f2ee14131 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

95969abb7e7c drm/doc/rfc: VM_BIND uapi definition
a40e87e2a2f3 drm/doc/rfc: VM_BIND feature design document

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22505/index.html

[-- Attachment #2: Type: text/html, Size: 10805 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-07 20:31   ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-03-09 15:58     ` Alex Deucher
  -1 siblings, 0 replies; 31+ messages in thread
From: Alex Deucher @ 2022-03-09 15:58 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Daniel Vetter, Intel Graphics Development, Thomas Hellstrom,
	chris.p.wilson, Maling list - DRI developers

On Mon, Mar 7, 2022 at 3:30 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:
>
> VM_BIND design document with description of intended use cases.
>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  2 files changed, 214 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..cdc6bb25b942
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,210 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> +a specified address space (VM).
> +
> +These mappings (also referred to as persistent mappings) will be persistent
> +across multiple GPU submissions (execbuff) issued by the UMD, without user
> +having to provide a list of all required mappings during each submission
> +(as required by older execbuff mode).
> +
> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> +flag is set (useful if mapping is required for an active context).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +A VM in VM_BIND mode will not support older execbuff mode of binding.
> +
> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> +but those BOs should be mapped ahead via vm_bind ioctl.
> +
> +VM_BIND features include,
> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +- VA mapping can map to a partial section of the BO (partial binding).
> +- Support capture of persistent mappings in the dump upon GPU error.
> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  usecases will be helpful.
> +- Asynchronous vm_bind and vm_unbind support.
> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> +  and for signaling batch completion in long running contexts (explained
> +  below).
> +
> +VM_PRIVATE objects
> +------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +VM_BIND locking order is as below.
> +
> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple pagefault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +
> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> +   while binding a vma and while updating dma-resv fence list of a BO.
> +   The private BOs of a VM will all share a dma-resv object.
> +
> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> +   call for unbinding and during execbuff path for binding the mapping and
> +   updating the dma-resv fence list of the BO.
> +
> +3) Spinlock/s to protect some of the VM's lists.
> +
> +We will also need support for bluk LRU movement of persistent mapping to
> +avoid additional latencies in execbuff path.
> +
> +GPU page faults
> +----------------
> +Both older execbuff mode and the newer VM_BIND mode of binding will require
> +using dma-fence to ensure residency.
> +In future when GPU page faults are supported, no dma-fence usage is required
> +as residency is purely managed by installing and removing/invalidating ptes.
> +
> +
> +User/Memory Fence
> +==================
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter.
> +
> +User/Memory fence is a <address, value> pair. To signal the user fence,
> +specified value will be written at the specified virtual address and
> +wakeup the waiting process. User can wait on an user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal an user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +
> +VM_BIND use cases
> +==================
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence. Compute must opt-in for this mechanism with
> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> +
> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> +and implicit dependency setting is not allowed on long running contexts.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context fence (called suspend fence) proxying as
> +i915_request fence. This suspend fence is enabled when there is a wait on it,
> +which triggers the context preemption.
> +
> +This is much easier to support with VM_BIND compared to the current heavier
> +execbuff path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debuggee) and attached
> +to GPU via vm_bind interface.
> +
> +Mesa/Valkun

s/Valkun/Vulkan/

Alex

> +------------
> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> +performance. For Vulkan it should be straightforward to use VM_BIND.
> +For Iris implicit buffer tracking must be implemented before we can harness
> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> +overhead becomes more important.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +usecases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Make pagetable allocations evictable and manage them similar to VM_BIND
> +  mapped objects. Page table pages are similar to persistent mappings of a
> +  VM (difference here are that the page table pages will not
> +  have an i915_vma structure and after swapping pages back in, parent page
> +  link needs to be updated).
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. Instead use underlying BO's
> +  dma-resv fence list to determine if a i915_vma is active or not.
> +
> +These can be worked upon after intitial vm_bind support is added.
> +
> +
> +UAPI
> +=====
> +Uapi definiton can be found here:
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-03-09 15:58     ` Alex Deucher
  0 siblings, 0 replies; 31+ messages in thread
From: Alex Deucher @ 2022-03-09 15:58 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Daniel Vetter, Intel Graphics Development, Thomas Hellstrom,
	chris.p.wilson, Maling list - DRI developers

On Mon, Mar 7, 2022 at 3:30 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:
>
> VM_BIND design document with description of intended use cases.
>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  2 files changed, 214 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..cdc6bb25b942
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,210 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> +a specified address space (VM).
> +
> +These mappings (also referred to as persistent mappings) will be persistent
> +across multiple GPU submissions (execbuff) issued by the UMD, without user
> +having to provide a list of all required mappings during each submission
> +(as required by older execbuff mode).
> +
> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> +flag is set (useful if mapping is required for an active context).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +A VM in VM_BIND mode will not support older execbuff mode of binding.
> +
> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> +but those BOs should be mapped ahead via vm_bind ioctl.
> +
> +VM_BIND features include,
> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +- VA mapping can map to a partial section of the BO (partial binding).
> +- Support capture of persistent mappings in the dump upon GPU error.
> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  usecases will be helpful.
> +- Asynchronous vm_bind and vm_unbind support.
> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> +  and for signaling batch completion in long running contexts (explained
> +  below).
> +
> +VM_PRIVATE objects
> +------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +VM_BIND locking order is as below.
> +
> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple pagefault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +
> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> +   while binding a vma and while updating dma-resv fence list of a BO.
> +   The private BOs of a VM will all share a dma-resv object.
> +
> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> +   call for unbinding and during execbuff path for binding the mapping and
> +   updating the dma-resv fence list of the BO.
> +
> +3) Spinlock/s to protect some of the VM's lists.
> +
> +We will also need support for bluk LRU movement of persistent mapping to
> +avoid additional latencies in execbuff path.
> +
> +GPU page faults
> +----------------
> +Both older execbuff mode and the newer VM_BIND mode of binding will require
> +using dma-fence to ensure residency.
> +In future when GPU page faults are supported, no dma-fence usage is required
> +as residency is purely managed by installing and removing/invalidating ptes.
> +
> +
> +User/Memory Fence
> +==================
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter.
> +
> +User/Memory fence is a <address, value> pair. To signal the user fence,
> +specified value will be written at the specified virtual address and
> +wakeup the waiting process. User can wait on an user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal an user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +
> +VM_BIND use cases
> +==================
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence. Compute must opt-in for this mechanism with
> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> +
> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> +and implicit dependency setting is not allowed on long running contexts.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context fence (called suspend fence) proxying as
> +i915_request fence. This suspend fence is enabled when there is a wait on it,
> +which triggers the context preemption.
> +
> +This is much easier to support with VM_BIND compared to the current heavier
> +execbuff path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debuggee) and attached
> +to GPU via vm_bind interface.
> +
> +Mesa/Valkun

s/Valkun/Vulkan/

Alex

> +------------
> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> +performance. For Vulkan it should be straightforward to use VM_BIND.
> +For Iris implicit buffer tracking must be implemented before we can harness
> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> +overhead becomes more important.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +usecases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Make pagetable allocations evictable and manage them similar to VM_BIND
> +  mapped objects. Page table pages are similar to persistent mappings of a
> +  VM (difference here are that the page table pages will not
> +  have an i915_vma structure and after swapping pages back in, parent page
> +  link needs to be updated).
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. Instead use underlying BO's
> +  dma-resv fence list to determine if a i915_vma is active or not.
> +
> +These can be worked upon after intitial vm_bind support is added.
> +
> +
> +UAPI
> +=====
> +Uapi definiton can be found here:
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 2/2] drm/doc/rfc: VM_BIND uapi definition
  2022-03-07 20:31   ` [Intel-gfx] " Niranjana Vishwanathapura
  (?)
@ 2022-03-30 12:51   ` Daniel Vetter
  2022-04-20 20:18     ` Niranjana Vishwanathapura
  -1 siblings, 1 reply; 31+ messages in thread
From: Daniel Vetter @ 2022-03-30 12:51 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: daniel.vetter, intel-gfx, thomas.hellstrom, chris.p.wilson, dri-devel

On Mon, Mar 07, 2022 at 12:31:46PM -0800, Niranjana Vishwanathapura wrote:
> VM_BIND und related uapi definitions
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++

Maybe as the top level comment: The point of documenting uapi isn't to
just spell out all the fields, but to define _how_ and _why_ things work.
This part is completely missing from these docs here.

>  1 file changed, 176 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index 000000000000..80f00ee6c8a1
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h

You need to incldue this somewhere so it's rendered, see the previous
examples.

> @@ -0,0 +1,176 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/* VM_BIND feature availability through drm_i915_getparam */
> +#define I915_PARAM_HAS_VM_BIND		57

Needs to be kernel-docified, which means we need a prep patch that fixes
up the existing mess.

> +
> +/* VM_BIND related ioctls */
> +#define DRM_I915_GEM_VM_BIND		0x3d
> +#define DRM_I915_GEM_VM_UNBIND		0x3e
> +#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
> +
> +#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
> +
> +/**
> + * struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind.

Both binding and unbinding need to specify in excruciating detail what
happens if there's overlaps (existing mappings, or unmapping a range which
has no mapping, or only partially full of maps or different objects) and
fun stuff like that.

> + */
> +struct drm_i915_gem_vm_bind {
> +	/** vm to [un]bind */
> +	__u32 vm_id;
> +
> +	/**
> +	 * BO handle or file descriptor.
> +	 * 'fd' value of -1 is reserved for system pages (SVM)
> +	 */
> +	union {
> +		__u32 handle; /* For unbind, it is reserved and must be 0 */

I think it'd be a lot cleaner if we do a bind and an unbind struct for
these, instead of mixing it up.

Also I thought mesa requested to be able to unmap an object from a vm
without a range. Has that been dropped, and confirmed to not be needed.

> +		__s32 fd;

If we don't need it right away then don't add it yet. If it's planned to
be used then it needs to be documented, but I kinda have no idea why you'd
need an fd for svm?

> +	}
> +
> +	/** VA start to [un]bind */
> +	__u64 start;
> +
> +	/** Offset in object to [un]bind */
> +	__u64 offset;
> +
> +	/** VA length to [un]bind */
> +	__u64 length;
> +
> +	/** Flags */
> +	__u64 flags;
> +	/** Bind the mapping immediately instead of during next submission */

This aint kerneldoc.

Also this needs to specify in much more detail what exactly this means,
and also how it interacts with execbuf.

So the patch here probably needs to include the missing pieces on the
execbuf side of things. Like how does execbuf work when it's used with a
vm_bind managed vm? That means:
- document the pieces that are there
- then add a patch to document how that all changes with vm_bind

And do that for everything execbuf can do.

> +#define I915_GEM_VM_BIND_IMMEDIATE   (1 << 0)
> +	/** Read-only mapping */
> +#define I915_GEM_VM_BIND_READONLY    (1 << 1)
> +	/** Capture this mapping in the dump upon GPU error */
> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 2)
> +
> +	/** Zero-terminated chain of extensions */
> +	__u64 extensions;
> +};
> +
> +/**
> + * struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension.
> + */
> +struct drm_i915_vm_bind_ext_user_fence {
> +#define I915_VM_BIND_EXT_USER_FENCE	0
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/** User/Memory fence qword alinged process virtual address */
> +	__u64 addr;
> +
> +	/** User/Memory fence value to be written after bind completion */
> +	__u64 val;
> +
> +	/** Reserved for future extensions */
> +	__u64 rsvd;
> +};
> +
> +/**
> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
> + * signaling extension.
> + *
> + * This extension allows user to attach a user fence (<addr, value> pair) to an
> + * execbuf to be signaled by the command streamer after the completion of 1st
> + * level batch, by writing the <value> at specified <addr> and triggering an
> + * interrupt.
> + * User can either poll for this user fence to signal or can also wait on it
> + * with i915_gem_wait_user_fence ioctl.
> + * This is very much usefaul for long running contexts where waiting on dma-fence
> + * by user (like i915_gem_wait ioctl) is not supported.
> + */
> +struct drm_i915_gem_execbuffer_ext_user_fence {
> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		0
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	struct i915_user_extension base;
> +
> +	/**
> +	 * User/Memory fence qword aligned GPU virtual address.
> +	 * Address has to be a valid GPU virtual address at the time of
> +	 * 1st level batch completion.
> +	 */
> +	__u64 addr;
> +
> +	/**
> +	 * User/Memory fence Value to be written to above address
> +	 * after 1st level batch completes.
> +	 */
> +	__u64 value;
> +
> +	/** Reserved for future extensions */
> +	__u64 rsvd;
> +};
> +
> +struct drm_i915_gem_vm_control {
> +/** Flag to opt-in for VM_BIND mode of binding during VM creation */

This is very confusingly docunmented and I have no idea how you're going
to use an empty extension. Also it's not kerneldoc.

Please check that the stuff you're creating renders properly in the html
output.

> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
> +};
> +
> +
> +struct drm_i915_gem_create_ext {
> +/** Extension to make the object private to a specified VM */
> +#define I915_GEM_CREATE_EXT_VM_PRIVATE		2

Why 2?

Also this all needs to be documented what it precisely means.

> +};
> +
> +
> +struct prelim_drm_i915_gem_context_create_ext {
> +/** Flag to declare context as long running */
> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)

The compute mode context, again including full impact on execbuf, is not
documented here. This also means any gaps in the context uapi
documentation need to be filled first in prep patches.

Also memory fences are extremely tricky, we need to specify in detail when
they're allowed to be used and when not. This needs to reference the
relevant sections from the dma-fence docs.

> +};
> +
> +/**
> + * struct drm_i915_gem_wait_user_fence
> + *
> + * Wait on user/memory fence. User/Memory fence can be woken up either by,
> + *    1. GPU context indicated by 'ctx_id', or,
> + *    2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
> + *       'ctx_id' is ignored when this flag is set.
> + *
> + * Wakeup when below condition is true.
> + * (*addr & MASK) OP (VALUE & MASK)
> + *
> + */
> +~struct drm_i915_gem_wait_user_fence {
> +	/** @base: Extension link. See struct i915_user_extension. */
> +	__u64 extensions;
> +
> +	/** User/Memory fence address */
> +	__u64 addr;
> +
> +	/** Id of the Context which will signal the fence. */
> +	__u32 ctx_id;
> +
> +	/** Wakeup condition operator */
> +	__u16 op;
> +#define I915_UFENCE_WAIT_EQ      0
> +#define I915_UFENCE_WAIT_NEQ     1
> +#define I915_UFENCE_WAIT_GT      2
> +#define I915_UFENCE_WAIT_GTE     3
> +#define I915_UFENCE_WAIT_LT      4
> +#define I915_UFENCE_WAIT_LTE     5
> +#define I915_UFENCE_WAIT_BEFORE  6
> +#define I915_UFENCE_WAIT_AFTER   7
> +
> +	/** Flags */
> +	__u16 flags;
> +#define I915_UFENCE_WAIT_SOFT    0x1
> +#define I915_UFENCE_WAIT_ABSTIME 0x2
> +
> +	/** Wakeup value */
> +	__u64 value;
> +
> +	/** Wakeup mask */
> +	__u64 mask;
> +#define I915_UFENCE_WAIT_U8     0xffu
> +#define I915_UFENCE_WAIT_U16    0xffffu
> +#define I915_UFENCE_WAIT_U32    0xfffffffful
> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull

Do we really need all these flags, and does the hw really support all the
combinations? Anything the hw doesn't support in MI_SEMAPHORE is pretty
much useless as a umf (userspace memory fence) mode.

> +
> +	/** Timeout */

Needs to specificy the clock source.
-Daniel

> +	__s64 timeout;
> +};
> -- 
> 2.21.0.rc0.32.g243a4c7e27
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-07 20:31   ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-03-31  8:28     ` Daniel Vetter
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-31  8:28 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Jason Ekstrand, Bloomfield, Jon,
	Dave Airlie, Ben Skeggs, Christian König, Daniel Stone
  Cc: daniel.vetter, intel-gfx, thomas.hellstrom, chris.p.wilson, dri-devel

Adding a pile of people who've expressed interest in vm_bind for their
drivers.

Also note to the intel folks: This is largely written with me having my
subsystem co-maintainer hat on, i.e. what I think is the right thing to do
here for the subsystem at large. There is substantial rework involved
here, but it's not any different from i915 adopting ttm or i915 adpoting
drm/sched, and I do think this stuff needs to happen in one form or
another.

On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  2 files changed, 214 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..cdc6bb25b942
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,210 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> +a specified address space (VM).
> +
> +These mappings (also referred to as persistent mappings) will be persistent
> +across multiple GPU submissions (execbuff) issued by the UMD, without user
> +having to provide a list of all required mappings during each submission
> +(as required by older execbuff mode).
> +
> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> +flag is set (useful if mapping is required for an active context).

So this is a screw-up I've done, and for upstream I think we need to fix
it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
I was wrong suggesting we should do this a few years back when we kicked
this off internally :-(

What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
things on top:
- in and out fences, like with execbuf, to allow userspace to sync with
  execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj

No sync_file or anything else like this at all. This means a bunch of
work, but also it'll have benefits because it means we should be able to
use exactly the same code paths and logic for both compute and for legacy
context, because drm_syncobj support future fence semantics.

Also on the implementation side we still need to install dma_fence to the
various dma_resv, and for this we need the new dma_resv_usage series from
Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
flag to make sure they never result in an oversync issue with execbuf. I
don't think trying to land vm_bind without that prep work in
dma_resv_usage makes sense.

Also as soon as dma_resv_usage has landed there's a few cleanups we should
do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the
  code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various
  hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
  expand on the kernel-doc for cache_dirty") for a bit more context

This is still not yet enough, since if a vm_bind races with an eviction we
might stall on the new buffers being readied first before the context can
continue. This needs some care to make sure that vma which aren't fully
bound yet are on a separate list, and vma which are marked for unbinding
are removed from the main working set list as soon as possible.

All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example
  flows
- umd need to ack this

The other thing here is the async/nonblocking path. I think we still need
that one, but again it should not sync with anything going on in execbuf,
but simply execute the ioctl code in a kernel thread. The idea here is
that this works like a special gpu engine, so that compute and vk can
schedule bindings interleaved with rendering. This should be enough to get
a performant vk sparse binding/textures implementation.

But I'm not entirely sure on this one, so this definitely needs acks from
umds.

> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +A VM in VM_BIND mode will not support older execbuff mode of binding.
> +
> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> +but those BOs should be mapped ahead via vm_bind ioctl.

should or must?

Also I'm not really sure that's a great interface. The batchbuffer really
only needs to be an address, so maybe all we need is an extension to
supply an u64 batchbuffer address instead of trying to retrofit this into
an unfitting current uapi.

And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls
  from Jason Ekstrand. I think we should land that first instead of
  hacking funny concepts together
- for gl the dma-buf import/export might not be fast enough, since gl
  needs to do a _lot_ of implicit sync. There we might need to use the
  execbuffer buffer list, but then we should have extremely clear uapi
  rules which disallow _everything_ except setting the explicit sync uapi

Again all this stuff needs to be documented in detail in the kerneldoc
uapi spec.

> +VM_BIND features include,
> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +- VA mapping can map to a partial section of the BO (partial binding).
> +- Support capture of persistent mappings in the dump upon GPU error.
> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  usecases will be helpful.
> +- Asynchronous vm_bind and vm_unbind support.
> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> +  and for signaling batch completion in long running contexts (explained
> +  below).

This should all be in the kerneldoc.

> +VM_PRIVATE objects
> +------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

Two things:

- I think the above is required to for initial vm_bind for vk, it kinda
  doesn't make much sense without that, and will allow us to match amdgpu
  and radeonsi

- Christian König just landed ttm bulk lru helpers, and I think we need to
  use those. This means vm_bind will only work with the ttm backend, but
  that's what we have for the big dgpu where vm_bind helps more in terms
  of performance, and the igfx conversion to ttm is already going on.

Furthermore the i915 shrinker lru has stopped being an lru, so I think
that should also be moved over to the ttm lru in some fashion to make sure
we once again have a reasonable and consistent memory aging and reclaim
architecture. The current code is just too much of a complete mess.

And since this is all fairly integral to how the code arch works I don't
think merging a different version which isn't based on ttm bulk lru
helpers makes sense.

Also I do think the page table lru handling needs to be included here,
because that's another complete hand-rolled separate world for not much
good reasons. I guess that can happen in parallel with the initial vm_bind
bring-up, but it needs to be completed by the time we add the features
beyond the initial support needed for vk.

> +VM_BIND locking hirarchy
> +-------------------------
> +VM_BIND locking order is as below.
> +
> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple pagefault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +
> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> +   while binding a vma and while updating dma-resv fence list of a BO.
> +   The private BOs of a VM will all share a dma-resv object.
> +
> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> +   call for unbinding and during execbuff path for binding the mapping and
> +   updating the dma-resv fence list of the BO.
> +
> +3) Spinlock/s to protect some of the VM's lists.
> +
> +We will also need support for bluk LRU movement of persistent mapping to
> +avoid additional latencies in execbuff path.

This needs more detail and explanation of how each level is required. Also
the shared dma_resv for VM_PRIVATE objects is kinda important to explain.

Like "some of the VM's lists" explains pretty much nothing.

> +
> +GPU page faults
> +----------------
> +Both older execbuff mode and the newer VM_BIND mode of binding will require
> +using dma-fence to ensure residency.
> +In future when GPU page faults are supported, no dma-fence usage is required
> +as residency is purely managed by installing and removing/invalidating ptes.

This is a bit confusing. I think one part of this should be moved into the
section with future vm_bind use-cases (we're not going to support page
faults with legacy softpin or even worse, relocations). The locking
discussion should be part of the much longer list of uses cases that
motivate the locking design.

> +
> +
> +User/Memory Fence
> +==================
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter.
> +
> +User/Memory fence is a <address, value> pair. To signal the user fence,
> +specified value will be written at the specified virtual address and
> +wakeup the waiting process. User can wait on an user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal an user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/

This is 1:1 tied to long running compute mode contexts (which in the uapi
doc must reference the endless amounts of bikeshed summary we have in the
docs about indefinite fences).

I'd put this into a new section about compute and userspace memory fences
support, with this and the next chapter ...
> +
> +
> +VM_BIND use cases
> +==================

... and then make this section here focus entirely on additional vm_bind
use-cases that we'll be adding later on. Which doesn't need to go into any
details, it's just justification for why we want to build the world on top
of vm_bind.

> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence. Compute must opt-in for this mechanism with
> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> +
> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> +and implicit dependency setting is not allowed on long running contexts.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context fence (called suspend fence) proxying as
> +i915_request fence. This suspend fence is enabled when there is a wait on it,
> +which triggers the context preemption.
> +
> +This is much easier to support with VM_BIND compared to the current heavier
> +execbuff path resource attachment.

There's a bunch of tricky code around compute mode context support, like
the preempt ctx fence (or suspend fence or whatever you want to call it),
and the resume work. And I think that code should be shared across
drivers.

I think the right place to put this is into drm/sched, somewhere attached
to the drm_sched_entity structure. I expect i915 folks to collaborate with
amd and ideally also get amdkfd to adopt the same thing if possible. At
least Christian has mentioned in the past that he's a bit unhappy about
how this works.

Also drm/sched has dependency tracking, which will be needed to pipeline
context resume operations. That needs to be used instead of i915-gem
inventing yet another dependency tracking data structure (it already has 3
and that's roughly 3 too many).

This means compute mode support and userspace memory fences are blocked on
the drm/sched conversion, but *eh* add it to the list of reasons for why
drm/sched needs to happen.

Also since we only have support for compute mode ctx in our internal tree
with the guc scheduler backend anyway, and the first conversion target is
the guc backend, I don't think this actually holds up a lot of the code.

> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.

This is really just a special case of compute mode contexts, I think I'd
include that in there, but explain better what it requires (i.e. vm_bind
not being synchronized against execbuf).

> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debuggee) and attached
> +to GPU via vm_bind interface.
> +
> +Mesa/Valkun
> +------------
> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> +performance. For Vulkan it should be straightforward to use VM_BIND.
> +For Iris implicit buffer tracking must be implemented before we can harness
> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> +overhead becomes more important.

Just to clarify, I don't think we can land vm_bind into upstream if it
doesn't work 100% for vk. There's a bit much "can" instead of "will in
this section".

> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface.

Userptr is absent here (and it's not the same as svm, at least on
discrete), and this is needed for the initial version since otherwise vk
can't use it because we're not at feature parity.

Irc discussions by Maarten and Dave came up with the idea that maybe
userptr for vm_bind should work _without_ any gem bo as backing storage,
since that guarantees that people don't come up with funny ideas like
trying to share such bo across process or mmap it and other nonsense which
just doesn't work.

> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +usecases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Make pagetable allocations evictable and manage them similar to VM_BIND
> +  mapped objects. Page table pages are similar to persistent mappings of a
> +  VM (difference here are that the page table pages will not
> +  have an i915_vma structure and after swapping pages back in, parent page
> +  link needs to be updated).

See above, but I think this should be included as part of the initial
vm_bind push.

> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. Instead use underlying BO's
> +  dma-resv fence list to determine if a i915_vma is active or not.

So this is a complete mess, and really should not exist. I think it needs
to be removed before we try to make i915_vma even more complex by adding
vm_bind.

The other thing I've been pondering here is that vm_bind is really
completely different from legacy vm structures for a lot of reasons:
- no relocation or softpin handling, which means vm_bind has no reason to
  ever look at the i915_vma structure in execbuf code. Unfortunately
  execbuf has been rewritten to be vma instead of obj centric, so it's a
  100% mismatch

- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
  that because the kernel manages the virtual address space fully. Again
  ideally that entire vma_move_to_active code and everything related to it
  would simply not exist.

- similar on the eviction side, the rules are quite different: For vm_bind
  we never tear down the vma, instead it's just moved to the list of
  evicted vma. Legacy vm have no need for all these additional lists, so
  another huge confusion.

- if the refcount is done correctly for vm_bind we wouldn't need the
  tricky code in the bo close paths. Unfortunately legacy vm with
  relocations and softpin require that vma are only a weak reference, so
  that cannot be removed.

- there's also a ton of special cases for ggtt handling, like the
  different views (for display, partial views for mmap), but also the
  gen2/3 alignment and padding requirements which vm_bind never needs.

I think the right thing here is to massively split the implementation
behind some solid vm/vma abstraction, with a base clase for vm and vma
which _only_ has the pieces which both vm_bind and the legacy vm stuff
needs. But it's a bit tricky to get there. I think a workable path would
be:
- Add a new base class to both i915_address_space and i915_vma, which
  starts out empty.

- As vm_bind code lands, move things that vm_bind code needs into these
  base classes

- The goal should be that these base classes are a stand-alone library
  that other drivers could reuse. Like we've done with the buddy
  allocator, which first moved from i915-gem to i915-ttm, and which amd
  now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
  interested in adding something like vm_bind should be involved from the
  start (or maybe the entire thing reused in amdgpu, they're looking at
  vk sparse binding support too or at least have perf issues I think).

- Locking must be the same across all implemntations, otherwise it's
  really not an abstract. i915 screwed this up terribly by having
  different locking rules for ppgtt and ggtt, which is just nonsense.

- The legacy specific code needs to be extracted as much as possible and
  shoved into separate files. In execbuf this means we need to get back to
  object centric flow, and the slowpaths need to become a lot simpler
  again (Maarten has cleaned up some of this, but there's still a silly
  amount of hacks in there with funny layering).

- I think if stuff like the vma eviction details (list movement and
  locking and refcounting of the underlying object)

> +
> +These can be worked upon after intitial vm_bind support is added.

I don't think that works, given how badly i915-gem team screwed up in
other places. And those places had to be fixed by adopting shared code
like ttm. Plus there's already a huge unfulffiled promise pending with the
drm/sched conversion, i915-gem team is clearly deeply in the red here :-/

Cheers, Daniel

> +
> +
> +UAPI
> +=====
> +Uapi definiton can be found here:
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-03-31  8:28     ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-31  8:28 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Jason Ekstrand, Bloomfield, Jon,
	Dave Airlie, Ben Skeggs, Christian König, Daniel Stone
  Cc: daniel.vetter, intel-gfx, thomas.hellstrom, chris.p.wilson, dri-devel

Adding a pile of people who've expressed interest in vm_bind for their
drivers.

Also note to the intel folks: This is largely written with me having my
subsystem co-maintainer hat on, i.e. what I think is the right thing to do
here for the subsystem at large. There is substantial rework involved
here, but it's not any different from i915 adopting ttm or i915 adpoting
drm/sched, and I do think this stuff needs to happen in one form or
another.

On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  2 files changed, 214 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..cdc6bb25b942
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,210 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> +a specified address space (VM).
> +
> +These mappings (also referred to as persistent mappings) will be persistent
> +across multiple GPU submissions (execbuff) issued by the UMD, without user
> +having to provide a list of all required mappings during each submission
> +(as required by older execbuff mode).
> +
> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> +flag is set (useful if mapping is required for an active context).

So this is a screw-up I've done, and for upstream I think we need to fix
it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
I was wrong suggesting we should do this a few years back when we kicked
this off internally :-(

What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
things on top:
- in and out fences, like with execbuf, to allow userspace to sync with
  execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj

No sync_file or anything else like this at all. This means a bunch of
work, but also it'll have benefits because it means we should be able to
use exactly the same code paths and logic for both compute and for legacy
context, because drm_syncobj support future fence semantics.

Also on the implementation side we still need to install dma_fence to the
various dma_resv, and for this we need the new dma_resv_usage series from
Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
flag to make sure they never result in an oversync issue with execbuf. I
don't think trying to land vm_bind without that prep work in
dma_resv_usage makes sense.

Also as soon as dma_resv_usage has landed there's a few cleanups we should
do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the
  code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various
  hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
  expand on the kernel-doc for cache_dirty") for a bit more context

This is still not yet enough, since if a vm_bind races with an eviction we
might stall on the new buffers being readied first before the context can
continue. This needs some care to make sure that vma which aren't fully
bound yet are on a separate list, and vma which are marked for unbinding
are removed from the main working set list as soon as possible.

All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example
  flows
- umd need to ack this

The other thing here is the async/nonblocking path. I think we still need
that one, but again it should not sync with anything going on in execbuf,
but simply execute the ioctl code in a kernel thread. The idea here is
that this works like a special gpu engine, so that compute and vk can
schedule bindings interleaved with rendering. This should be enough to get
a performant vk sparse binding/textures implementation.

But I'm not entirely sure on this one, so this definitely needs acks from
umds.

> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +A VM in VM_BIND mode will not support older execbuff mode of binding.
> +
> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> +but those BOs should be mapped ahead via vm_bind ioctl.

should or must?

Also I'm not really sure that's a great interface. The batchbuffer really
only needs to be an address, so maybe all we need is an extension to
supply an u64 batchbuffer address instead of trying to retrofit this into
an unfitting current uapi.

And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls
  from Jason Ekstrand. I think we should land that first instead of
  hacking funny concepts together
- for gl the dma-buf import/export might not be fast enough, since gl
  needs to do a _lot_ of implicit sync. There we might need to use the
  execbuffer buffer list, but then we should have extremely clear uapi
  rules which disallow _everything_ except setting the explicit sync uapi

Again all this stuff needs to be documented in detail in the kerneldoc
uapi spec.

> +VM_BIND features include,
> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +- VA mapping can map to a partial section of the BO (partial binding).
> +- Support capture of persistent mappings in the dump upon GPU error.
> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  usecases will be helpful.
> +- Asynchronous vm_bind and vm_unbind support.
> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> +  and for signaling batch completion in long running contexts (explained
> +  below).

This should all be in the kerneldoc.

> +VM_PRIVATE objects
> +------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

Two things:

- I think the above is required to for initial vm_bind for vk, it kinda
  doesn't make much sense without that, and will allow us to match amdgpu
  and radeonsi

- Christian König just landed ttm bulk lru helpers, and I think we need to
  use those. This means vm_bind will only work with the ttm backend, but
  that's what we have for the big dgpu where vm_bind helps more in terms
  of performance, and the igfx conversion to ttm is already going on.

Furthermore the i915 shrinker lru has stopped being an lru, so I think
that should also be moved over to the ttm lru in some fashion to make sure
we once again have a reasonable and consistent memory aging and reclaim
architecture. The current code is just too much of a complete mess.

And since this is all fairly integral to how the code arch works I don't
think merging a different version which isn't based on ttm bulk lru
helpers makes sense.

Also I do think the page table lru handling needs to be included here,
because that's another complete hand-rolled separate world for not much
good reasons. I guess that can happen in parallel with the initial vm_bind
bring-up, but it needs to be completed by the time we add the features
beyond the initial support needed for vk.

> +VM_BIND locking hirarchy
> +-------------------------
> +VM_BIND locking order is as below.
> +
> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple pagefault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +
> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> +   while binding a vma and while updating dma-resv fence list of a BO.
> +   The private BOs of a VM will all share a dma-resv object.
> +
> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> +   call for unbinding and during execbuff path for binding the mapping and
> +   updating the dma-resv fence list of the BO.
> +
> +3) Spinlock/s to protect some of the VM's lists.
> +
> +We will also need support for bluk LRU movement of persistent mapping to
> +avoid additional latencies in execbuff path.

This needs more detail and explanation of how each level is required. Also
the shared dma_resv for VM_PRIVATE objects is kinda important to explain.

Like "some of the VM's lists" explains pretty much nothing.

> +
> +GPU page faults
> +----------------
> +Both older execbuff mode and the newer VM_BIND mode of binding will require
> +using dma-fence to ensure residency.
> +In future when GPU page faults are supported, no dma-fence usage is required
> +as residency is purely managed by installing and removing/invalidating ptes.

This is a bit confusing. I think one part of this should be moved into the
section with future vm_bind use-cases (we're not going to support page
faults with legacy softpin or even worse, relocations). The locking
discussion should be part of the much longer list of uses cases that
motivate the locking design.

> +
> +
> +User/Memory Fence
> +==================
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter.
> +
> +User/Memory fence is a <address, value> pair. To signal the user fence,
> +specified value will be written at the specified virtual address and
> +wakeup the waiting process. User can wait on an user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal an user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/

This is 1:1 tied to long running compute mode contexts (which in the uapi
doc must reference the endless amounts of bikeshed summary we have in the
docs about indefinite fences).

I'd put this into a new section about compute and userspace memory fences
support, with this and the next chapter ...
> +
> +
> +VM_BIND use cases
> +==================

... and then make this section here focus entirely on additional vm_bind
use-cases that we'll be adding later on. Which doesn't need to go into any
details, it's just justification for why we want to build the world on top
of vm_bind.

> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence. Compute must opt-in for this mechanism with
> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> +
> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> +and implicit dependency setting is not allowed on long running contexts.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context fence (called suspend fence) proxying as
> +i915_request fence. This suspend fence is enabled when there is a wait on it,
> +which triggers the context preemption.
> +
> +This is much easier to support with VM_BIND compared to the current heavier
> +execbuff path resource attachment.

There's a bunch of tricky code around compute mode context support, like
the preempt ctx fence (or suspend fence or whatever you want to call it),
and the resume work. And I think that code should be shared across
drivers.

I think the right place to put this is into drm/sched, somewhere attached
to the drm_sched_entity structure. I expect i915 folks to collaborate with
amd and ideally also get amdkfd to adopt the same thing if possible. At
least Christian has mentioned in the past that he's a bit unhappy about
how this works.

Also drm/sched has dependency tracking, which will be needed to pipeline
context resume operations. That needs to be used instead of i915-gem
inventing yet another dependency tracking data structure (it already has 3
and that's roughly 3 too many).

This means compute mode support and userspace memory fences are blocked on
the drm/sched conversion, but *eh* add it to the list of reasons for why
drm/sched needs to happen.

Also since we only have support for compute mode ctx in our internal tree
with the guc scheduler backend anyway, and the first conversion target is
the guc backend, I don't think this actually holds up a lot of the code.

> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.

This is really just a special case of compute mode contexts, I think I'd
include that in there, but explain better what it requires (i.e. vm_bind
not being synchronized against execbuf).

> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debuggee) and attached
> +to GPU via vm_bind interface.
> +
> +Mesa/Valkun
> +------------
> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> +performance. For Vulkan it should be straightforward to use VM_BIND.
> +For Iris implicit buffer tracking must be implemented before we can harness
> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> +overhead becomes more important.

Just to clarify, I don't think we can land vm_bind into upstream if it
doesn't work 100% for vk. There's a bit much "can" instead of "will in
this section".

> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface.

Userptr is absent here (and it's not the same as svm, at least on
discrete), and this is needed for the initial version since otherwise vk
can't use it because we're not at feature parity.

Irc discussions by Maarten and Dave came up with the idea that maybe
userptr for vm_bind should work _without_ any gem bo as backing storage,
since that guarantees that people don't come up with funny ideas like
trying to share such bo across process or mmap it and other nonsense which
just doesn't work.

> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +usecases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Make pagetable allocations evictable and manage them similar to VM_BIND
> +  mapped objects. Page table pages are similar to persistent mappings of a
> +  VM (difference here are that the page table pages will not
> +  have an i915_vma structure and after swapping pages back in, parent page
> +  link needs to be updated).

See above, but I think this should be included as part of the initial
vm_bind push.

> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. Instead use underlying BO's
> +  dma-resv fence list to determine if a i915_vma is active or not.

So this is a complete mess, and really should not exist. I think it needs
to be removed before we try to make i915_vma even more complex by adding
vm_bind.

The other thing I've been pondering here is that vm_bind is really
completely different from legacy vm structures for a lot of reasons:
- no relocation or softpin handling, which means vm_bind has no reason to
  ever look at the i915_vma structure in execbuf code. Unfortunately
  execbuf has been rewritten to be vma instead of obj centric, so it's a
  100% mismatch

- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
  that because the kernel manages the virtual address space fully. Again
  ideally that entire vma_move_to_active code and everything related to it
  would simply not exist.

- similar on the eviction side, the rules are quite different: For vm_bind
  we never tear down the vma, instead it's just moved to the list of
  evicted vma. Legacy vm have no need for all these additional lists, so
  another huge confusion.

- if the refcount is done correctly for vm_bind we wouldn't need the
  tricky code in the bo close paths. Unfortunately legacy vm with
  relocations and softpin require that vma are only a weak reference, so
  that cannot be removed.

- there's also a ton of special cases for ggtt handling, like the
  different views (for display, partial views for mmap), but also the
  gen2/3 alignment and padding requirements which vm_bind never needs.

I think the right thing here is to massively split the implementation
behind some solid vm/vma abstraction, with a base clase for vm and vma
which _only_ has the pieces which both vm_bind and the legacy vm stuff
needs. But it's a bit tricky to get there. I think a workable path would
be:
- Add a new base class to both i915_address_space and i915_vma, which
  starts out empty.

- As vm_bind code lands, move things that vm_bind code needs into these
  base classes

- The goal should be that these base classes are a stand-alone library
  that other drivers could reuse. Like we've done with the buddy
  allocator, which first moved from i915-gem to i915-ttm, and which amd
  now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
  interested in adding something like vm_bind should be involved from the
  start (or maybe the entire thing reused in amdgpu, they're looking at
  vk sparse binding support too or at least have perf issues I think).

- Locking must be the same across all implemntations, otherwise it's
  really not an abstract. i915 screwed this up terribly by having
  different locking rules for ppgtt and ggtt, which is just nonsense.

- The legacy specific code needs to be extracted as much as possible and
  shoved into separate files. In execbuf this means we need to get back to
  object centric flow, and the slowpaths need to become a lot simpler
  again (Maarten has cleaned up some of this, but there's still a silly
  amount of hacks in there with funny layering).

- I think if stuff like the vma eviction details (list movement and
  locking and refcounting of the underlying object)

> +
> +These can be worked upon after intitial vm_bind support is added.

I don't think that works, given how badly i915-gem team screwed up in
other places. And those places had to be fixed by adopting shared code
like ttm. Plus there's already a huge unfulffiled promise pending with the
drm/sched conversion, i915-gem team is clearly deeply in the red here :-/

Cheers, Daniel

> +
> +
> +UAPI
> +=====
> +Uapi definiton can be found here:
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27
>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-31  8:28     ` Daniel Vetter
@ 2022-03-31 11:37       ` Daniel Vetter
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-31 11:37 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Jason Ekstrand, Bloomfield, Jon,
	Dave Airlie, Ben Skeggs, Christian König, Daniel Stone
  Cc: daniel.vetter, intel-gfx, thomas.hellstrom, chris.p.wilson, dri-devel

One thing I've forgotten, since it's only hinted at here: If/when we
switch tlb flushing from the current dumb&synchronous implementation
we now have in i915 in upstream to one with batching using dma_fence,
then I think that should be something which is done with a small
helper library of shared code too. The batching is somewhat tricky,
and you need to make sure you put the fence into the right
dma_resv_usage slot, and the trick with replace the vm fence with a
tlb flush fence is also a good reason to share the code so we only
have it one.

Christian's recent work also has some prep work for this already with
the fence replacing trick.
-Daniel

On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@ffwll.ch> wrote:
> Adding a pile of people who've expressed interest in vm_bind for their
> drivers.
>
> Also note to the intel folks: This is largely written with me having my
> subsystem co-maintainer hat on, i.e. what I think is the right thing to do
> here for the subsystem at large. There is substantial rework involved
> here, but it's not any different from i915 adopting ttm or i915 adpoting
> drm/sched, and I do think this stuff needs to happen in one form or
> another.
>
> On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> > VM_BIND design document with description of intended use cases.
> >
> > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > ---
> >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
> >  Documentation/gpu/rfc/index.rst        |   4 +
> >  2 files changed, 214 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >
> > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> > new file mode 100644
> > index 000000000000..cdc6bb25b942
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> > @@ -0,0 +1,210 @@
> > +==========================================
> > +I915 VM_BIND feature design and use cases
> > +==========================================
> > +
> > +VM_BIND feature
> > +================
> > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> > +a specified address space (VM).
> > +
> > +These mappings (also referred to as persistent mappings) will be persistent
> > +across multiple GPU submissions (execbuff) issued by the UMD, without user
> > +having to provide a list of all required mappings during each submission
> > +(as required by older execbuff mode).
> > +
> > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> > +flag is set (useful if mapping is required for an active context).
>
> So this is a screw-up I've done, and for upstream I think we need to fix
> it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
> I was wrong suggesting we should do this a few years back when we kicked
> this off internally :-(
>
> What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
> things on top:
> - in and out fences, like with execbuf, to allow userspace to sync with
>   execbuf as needed
> - for compute-mode context this means userspace memory fences
> - for legacy context this means a timeline syncobj in drm_syncobj
>
> No sync_file or anything else like this at all. This means a bunch of
> work, but also it'll have benefits because it means we should be able to
> use exactly the same code paths and logic for both compute and for legacy
> context, because drm_syncobj support future fence semantics.
>
> Also on the implementation side we still need to install dma_fence to the
> various dma_resv, and for this we need the new dma_resv_usage series from
> Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
> flag to make sure they never result in an oversync issue with execbuf. I
> don't think trying to land vm_bind without that prep work in
> dma_resv_usage makes sense.
>
> Also as soon as dma_resv_usage has landed there's a few cleanups we should
> do in i915:
> - ttm bo moving code should probably simplify a bit (and maybe more of the
>   code should be pushed as helpers into ttm)
> - clflush code should be moved over to using USAGE_KERNEL and the various
>   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>   expand on the kernel-doc for cache_dirty") for a bit more context
>
> This is still not yet enough, since if a vm_bind races with an eviction we
> might stall on the new buffers being readied first before the context can
> continue. This needs some care to make sure that vma which aren't fully
> bound yet are on a separate list, and vma which are marked for unbinding
> are removed from the main working set list as soon as possible.
>
> All of these things are relevant for the uapi semantics, which means
> - they need to be documented in the uapi kerneldoc, ideally with example
>   flows
> - umd need to ack this
>
> The other thing here is the async/nonblocking path. I think we still need
> that one, but again it should not sync with anything going on in execbuf,
> but simply execute the ioctl code in a kernel thread. The idea here is
> that this works like a special gpu engine, so that compute and vk can
> schedule bindings interleaved with rendering. This should be enough to get
> a performant vk sparse binding/textures implementation.
>
> But I'm not entirely sure on this one, so this definitely needs acks from
> umds.
>
> > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> > +A VM in VM_BIND mode will not support older execbuff mode of binding.
> > +
> > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> > +but those BOs should be mapped ahead via vm_bind ioctl.
>
> should or must?
>
> Also I'm not really sure that's a great interface. The batchbuffer really
> only needs to be an address, so maybe all we need is an extension to
> supply an u64 batchbuffer address instead of trying to retrofit this into
> an unfitting current uapi.
>
> And for implicit sync there's two things:
> - for vk I think the right uapi is the dma-buf fence import/export ioctls
>   from Jason Ekstrand. I think we should land that first instead of
>   hacking funny concepts together
> - for gl the dma-buf import/export might not be fast enough, since gl
>   needs to do a _lot_ of implicit sync. There we might need to use the
>   execbuffer buffer list, but then we should have extremely clear uapi
>   rules which disallow _everything_ except setting the explicit sync uapi
>
> Again all this stuff needs to be documented in detail in the kerneldoc
> uapi spec.
>
> > +VM_BIND features include,
> > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> > +  of an object (aliasing).
> > +- VA mapping can map to a partial section of the BO (partial binding).
> > +- Support capture of persistent mappings in the dump upon GPU error.
> > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> > +  usecases will be helpful.
> > +- Asynchronous vm_bind and vm_unbind support.
> > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> > +  and for signaling batch completion in long running contexts (explained
> > +  below).
>
> This should all be in the kerneldoc.
>
> > +VM_PRIVATE objects
> > +------------------
> > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> > +exported. Hence these BOs are referred to as Shared BOs.
> > +During each execbuff submission, the request fence must be added to the
> > +dma-resv fence list of all shared BOs mapped on the VM.
> > +
> > +VM_BIND feature introduces an optimization where user can create BO which
> > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> > +the VM they are private to and can't be dma-buf exported.
> > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> > +submission, they need only one dma-resv fence list updated. Thus the fast
> > +path (where required mappings are already bound) submission latency is O(1)
> > +w.r.t the number of VM private BOs.
>
> Two things:
>
> - I think the above is required to for initial vm_bind for vk, it kinda
>   doesn't make much sense without that, and will allow us to match amdgpu
>   and radeonsi
>
> - Christian König just landed ttm bulk lru helpers, and I think we need to
>   use those. This means vm_bind will only work with the ttm backend, but
>   that's what we have for the big dgpu where vm_bind helps more in terms
>   of performance, and the igfx conversion to ttm is already going on.
>
> Furthermore the i915 shrinker lru has stopped being an lru, so I think
> that should also be moved over to the ttm lru in some fashion to make sure
> we once again have a reasonable and consistent memory aging and reclaim
> architecture. The current code is just too much of a complete mess.
>
> And since this is all fairly integral to how the code arch works I don't
> think merging a different version which isn't based on ttm bulk lru
> helpers makes sense.
>
> Also I do think the page table lru handling needs to be included here,
> because that's another complete hand-rolled separate world for not much
> good reasons. I guess that can happen in parallel with the initial vm_bind
> bring-up, but it needs to be completed by the time we add the features
> beyond the initial support needed for vk.
>
> > +VM_BIND locking hirarchy
> > +-------------------------
> > +VM_BIND locking order is as below.
> > +
> > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> > +
> > +   In future, when GPU page faults are supported, we can potentially use a
> > +   rwsem instead, so that multiple pagefault handlers can take the read side
> > +   lock to lookup the mapping and hence can run in parallel.
> > +
> > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> > +   while binding a vma and while updating dma-resv fence list of a BO.
> > +   The private BOs of a VM will all share a dma-resv object.
> > +
> > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> > +   call for unbinding and during execbuff path for binding the mapping and
> > +   updating the dma-resv fence list of the BO.
> > +
> > +3) Spinlock/s to protect some of the VM's lists.
> > +
> > +We will also need support for bluk LRU movement of persistent mapping to
> > +avoid additional latencies in execbuff path.
>
> This needs more detail and explanation of how each level is required. Also
> the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>
> Like "some of the VM's lists" explains pretty much nothing.
>
> > +
> > +GPU page faults
> > +----------------
> > +Both older execbuff mode and the newer VM_BIND mode of binding will require
> > +using dma-fence to ensure residency.
> > +In future when GPU page faults are supported, no dma-fence usage is required
> > +as residency is purely managed by installing and removing/invalidating ptes.
>
> This is a bit confusing. I think one part of this should be moved into the
> section with future vm_bind use-cases (we're not going to support page
> faults with legacy softpin or even worse, relocations). The locking
> discussion should be part of the much longer list of uses cases that
> motivate the locking design.
>
> > +
> > +
> > +User/Memory Fence
> > +==================
> > +The idea is to take a user specified virtual address and install an interrupt
> > +handler to wake up the current task when the memory location passes the user
> > +supplied filter.
> > +
> > +User/Memory fence is a <address, value> pair. To signal the user fence,
> > +specified value will be written at the specified virtual address and
> > +wakeup the waiting process. User can wait on an user fence with the
> > +gem_wait_user_fence ioctl.
> > +
> > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> > +interrupt within their batches after updating the value to have sub-batch
> > +precision on the wakeup. Each batch can signal an user fence to indicate
> > +the completion of next level batch. The completion of very first level batch
> > +needs to be signaled by the command streamer. The user must provide the
> > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> > +extension of execbuff ioctl, so that KMD can setup the command streamer to
> > +signal it.
> > +
> > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> > +the user process after completion of an asynchronous operation.
> > +
> > +When VM_BIND ioctl was provided with a user/memory fence via the
> > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> > +of binding of that mapping. All async binds/unbinds are serialized, hence
> > +signaling of user/memory fence also indicate the completion of all previous
> > +binds/unbinds.
> > +
> > +This feature will be derived from the below original work:
> > +https://patchwork.freedesktop.org/patch/349417/
>
> This is 1:1 tied to long running compute mode contexts (which in the uapi
> doc must reference the endless amounts of bikeshed summary we have in the
> docs about indefinite fences).
>
> I'd put this into a new section about compute and userspace memory fences
> support, with this and the next chapter ...
> > +
> > +
> > +VM_BIND use cases
> > +==================
>
> ... and then make this section here focus entirely on additional vm_bind
> use-cases that we'll be adding later on. Which doesn't need to go into any
> details, it's just justification for why we want to build the world on top
> of vm_bind.
>
> > +
> > +Long running Compute contexts
> > +------------------------------
> > +Usage of dma-fence expects that they complete in reasonable amount of time.
> > +Compute on the other hand can be long running. Hence it is appropriate for
> > +compute to use user/memory fence and dma-fence usage will be limited to
> > +in-kernel consumption only. This requires an execbuff uapi extension to pass
> > +in user fence. Compute must opt-in for this mechanism with
> > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> > +
> > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> > +and implicit dependency setting is not allowed on long running contexts.
> > +
> > +Where GPU page faults are not available, kernel driver upon buffer invalidation
> > +will initiate a suspend (preemption) of long running context with a dma-fence
> > +attached to it. And upon completion of that suspend fence, finish the
> > +invalidation, revalidate the BO and then resume the compute context. This is
> > +done by having a per-context fence (called suspend fence) proxying as
> > +i915_request fence. This suspend fence is enabled when there is a wait on it,
> > +which triggers the context preemption.
> > +
> > +This is much easier to support with VM_BIND compared to the current heavier
> > +execbuff path resource attachment.
>
> There's a bunch of tricky code around compute mode context support, like
> the preempt ctx fence (or suspend fence or whatever you want to call it),
> and the resume work. And I think that code should be shared across
> drivers.
>
> I think the right place to put this is into drm/sched, somewhere attached
> to the drm_sched_entity structure. I expect i915 folks to collaborate with
> amd and ideally also get amdkfd to adopt the same thing if possible. At
> least Christian has mentioned in the past that he's a bit unhappy about
> how this works.
>
> Also drm/sched has dependency tracking, which will be needed to pipeline
> context resume operations. That needs to be used instead of i915-gem
> inventing yet another dependency tracking data structure (it already has 3
> and that's roughly 3 too many).
>
> This means compute mode support and userspace memory fences are blocked on
> the drm/sched conversion, but *eh* add it to the list of reasons for why
> drm/sched needs to happen.
>
> Also since we only have support for compute mode ctx in our internal tree
> with the guc scheduler backend anyway, and the first conversion target is
> the guc backend, I don't think this actually holds up a lot of the code.
>
> > +Low Latency Submission
> > +-----------------------
> > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>
> This is really just a special case of compute mode contexts, I think I'd
> include that in there, but explain better what it requires (i.e. vm_bind
> not being synchronized against execbuf).
>
> > +
> > +Debugger
> > +---------
> > +With debug event interface user space process (debugger) is able to keep track
> > +of and act upon resources created by another process (debuggee) and attached
> > +to GPU via vm_bind interface.
> > +
> > +Mesa/Valkun
> > +------------
> > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> > +performance. For Vulkan it should be straightforward to use VM_BIND.
> > +For Iris implicit buffer tracking must be implemented before we can harness
> > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> > +overhead becomes more important.
>
> Just to clarify, I don't think we can land vm_bind into upstream if it
> doesn't work 100% for vk. There's a bit much "can" instead of "will in
> this section".
>
> > +
> > +Page level hints settings
> > +--------------------------
> > +VM_BIND allows any hints setting per mapping instead of per BO.
> > +Possible hints include read-only, placement and atomicity.
> > +Sub-BO level placement hint will be even more relevant with
> > +upcoming GPU on-demand page fault support.
> > +
> > +Page level Cache/CLOS settings
> > +-------------------------------
> > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> > +
> > +Shared Virtual Memory (SVM) support
> > +------------------------------------
> > +VM_BIND interface can be used to map system memory directly (without gem BO
> > +abstraction) using the HMM interface.
>
> Userptr is absent here (and it's not the same as svm, at least on
> discrete), and this is needed for the initial version since otherwise vk
> can't use it because we're not at feature parity.
>
> Irc discussions by Maarten and Dave came up with the idea that maybe
> userptr for vm_bind should work _without_ any gem bo as backing storage,
> since that guarantees that people don't come up with funny ideas like
> trying to share such bo across process or mmap it and other nonsense which
> just doesn't work.
>
> > +
> > +
> > +Broder i915 cleanups
> > +=====================
> > +Supporting this whole new vm_bind mode of binding which comes with its own
> > +usecases to support and the locking requirements requires proper integration
> > +with the existing i915 driver. This calls for some broader i915 driver
> > +cleanups/simplifications for maintainability of the driver going forward.
> > +Here are few things identified and are being looked into.
> > +
> > +- Make pagetable allocations evictable and manage them similar to VM_BIND
> > +  mapped objects. Page table pages are similar to persistent mappings of a
> > +  VM (difference here are that the page table pages will not
> > +  have an i915_vma structure and after swapping pages back in, parent page
> > +  link needs to be updated).
>
> See above, but I think this should be included as part of the initial
> vm_bind push.
>
> > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> > +  do not use it and complexity it brings in is probably more than the
> > +  performance advantage we get in legacy execbuff case.
> > +- Remove vma->open_count counting
> > +- Remove i915_vma active reference tracking. Instead use underlying BO's
> > +  dma-resv fence list to determine if a i915_vma is active or not.
>
> So this is a complete mess, and really should not exist. I think it needs
> to be removed before we try to make i915_vma even more complex by adding
> vm_bind.
>
> The other thing I've been pondering here is that vm_bind is really
> completely different from legacy vm structures for a lot of reasons:
> - no relocation or softpin handling, which means vm_bind has no reason to
>   ever look at the i915_vma structure in execbuf code. Unfortunately
>   execbuf has been rewritten to be vma instead of obj centric, so it's a
>   100% mismatch
>
> - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>   that because the kernel manages the virtual address space fully. Again
>   ideally that entire vma_move_to_active code and everything related to it
>   would simply not exist.
>
> - similar on the eviction side, the rules are quite different: For vm_bind
>   we never tear down the vma, instead it's just moved to the list of
>   evicted vma. Legacy vm have no need for all these additional lists, so
>   another huge confusion.
>
> - if the refcount is done correctly for vm_bind we wouldn't need the
>   tricky code in the bo close paths. Unfortunately legacy vm with
>   relocations and softpin require that vma are only a weak reference, so
>   that cannot be removed.
>
> - there's also a ton of special cases for ggtt handling, like the
>   different views (for display, partial views for mmap), but also the
>   gen2/3 alignment and padding requirements which vm_bind never needs.
>
> I think the right thing here is to massively split the implementation
> behind some solid vm/vma abstraction, with a base clase for vm and vma
> which _only_ has the pieces which both vm_bind and the legacy vm stuff
> needs. But it's a bit tricky to get there. I think a workable path would
> be:
> - Add a new base class to both i915_address_space and i915_vma, which
>   starts out empty.
>
> - As vm_bind code lands, move things that vm_bind code needs into these
>   base classes
>
> - The goal should be that these base classes are a stand-alone library
>   that other drivers could reuse. Like we've done with the buddy
>   allocator, which first moved from i915-gem to i915-ttm, and which amd
>   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>   interested in adding something like vm_bind should be involved from the
>   start (or maybe the entire thing reused in amdgpu, they're looking at
>   vk sparse binding support too or at least have perf issues I think).
>
> - Locking must be the same across all implemntations, otherwise it's
>   really not an abstract. i915 screwed this up terribly by having
>   different locking rules for ppgtt and ggtt, which is just nonsense.
>
> - The legacy specific code needs to be extracted as much as possible and
>   shoved into separate files. In execbuf this means we need to get back to
>   object centric flow, and the slowpaths need to become a lot simpler
>   again (Maarten has cleaned up some of this, but there's still a silly
>   amount of hacks in there with funny layering).
>
> - I think if stuff like the vma eviction details (list movement and
>   locking and refcounting of the underlying object)
>
> > +
> > +These can be worked upon after intitial vm_bind support is added.
>
> I don't think that works, given how badly i915-gem team screwed up in
> other places. And those places had to be fixed by adopting shared code
> like ttm. Plus there's already a huge unfulffiled promise pending with the
> drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>
> Cheers, Daniel
>
> > +
> > +
> > +UAPI
> > +=====
> > +Uapi definiton can be found here:
> > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > index 91e93a705230..7d10c36b268d 100644
> > --- a/Documentation/gpu/rfc/index.rst
> > +++ b/Documentation/gpu/rfc/index.rst
> > @@ -23,3 +23,7 @@ host such documentation:
> >  .. toctree::
> >
> >      i915_scheduler.rst
> > +
> > +.. toctree::
> > +
> > +    i915_vm_bind.rst
> > --
> > 2.21.0.rc0.32.g243a4c7e27
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-03-31 11:37       ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-31 11:37 UTC (permalink / raw)
  To: Niranjana Vishwanathapura, Jason Ekstrand, Bloomfield, Jon,
	Dave Airlie, Ben Skeggs, Christian König, Daniel Stone
  Cc: daniel.vetter, intel-gfx, thomas.hellstrom, chris.p.wilson, dri-devel

One thing I've forgotten, since it's only hinted at here: If/when we
switch tlb flushing from the current dumb&synchronous implementation
we now have in i915 in upstream to one with batching using dma_fence,
then I think that should be something which is done with a small
helper library of shared code too. The batching is somewhat tricky,
and you need to make sure you put the fence into the right
dma_resv_usage slot, and the trick with replace the vm fence with a
tlb flush fence is also a good reason to share the code so we only
have it one.

Christian's recent work also has some prep work for this already with
the fence replacing trick.
-Daniel

On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@ffwll.ch> wrote:
> Adding a pile of people who've expressed interest in vm_bind for their
> drivers.
>
> Also note to the intel folks: This is largely written with me having my
> subsystem co-maintainer hat on, i.e. what I think is the right thing to do
> here for the subsystem at large. There is substantial rework involved
> here, but it's not any different from i915 adopting ttm or i915 adpoting
> drm/sched, and I do think this stuff needs to happen in one form or
> another.
>
> On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> > VM_BIND design document with description of intended use cases.
> >
> > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > ---
> >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
> >  Documentation/gpu/rfc/index.rst        |   4 +
> >  2 files changed, 214 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >
> > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> > new file mode 100644
> > index 000000000000..cdc6bb25b942
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> > @@ -0,0 +1,210 @@
> > +==========================================
> > +I915 VM_BIND feature design and use cases
> > +==========================================
> > +
> > +VM_BIND feature
> > +================
> > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> > +a specified address space (VM).
> > +
> > +These mappings (also referred to as persistent mappings) will be persistent
> > +across multiple GPU submissions (execbuff) issued by the UMD, without user
> > +having to provide a list of all required mappings during each submission
> > +(as required by older execbuff mode).
> > +
> > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> > +flag is set (useful if mapping is required for an active context).
>
> So this is a screw-up I've done, and for upstream I think we need to fix
> it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
> I was wrong suggesting we should do this a few years back when we kicked
> this off internally :-(
>
> What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
> things on top:
> - in and out fences, like with execbuf, to allow userspace to sync with
>   execbuf as needed
> - for compute-mode context this means userspace memory fences
> - for legacy context this means a timeline syncobj in drm_syncobj
>
> No sync_file or anything else like this at all. This means a bunch of
> work, but also it'll have benefits because it means we should be able to
> use exactly the same code paths and logic for both compute and for legacy
> context, because drm_syncobj support future fence semantics.
>
> Also on the implementation side we still need to install dma_fence to the
> various dma_resv, and for this we need the new dma_resv_usage series from
> Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
> flag to make sure they never result in an oversync issue with execbuf. I
> don't think trying to land vm_bind without that prep work in
> dma_resv_usage makes sense.
>
> Also as soon as dma_resv_usage has landed there's a few cleanups we should
> do in i915:
> - ttm bo moving code should probably simplify a bit (and maybe more of the
>   code should be pushed as helpers into ttm)
> - clflush code should be moved over to using USAGE_KERNEL and the various
>   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>   expand on the kernel-doc for cache_dirty") for a bit more context
>
> This is still not yet enough, since if a vm_bind races with an eviction we
> might stall on the new buffers being readied first before the context can
> continue. This needs some care to make sure that vma which aren't fully
> bound yet are on a separate list, and vma which are marked for unbinding
> are removed from the main working set list as soon as possible.
>
> All of these things are relevant for the uapi semantics, which means
> - they need to be documented in the uapi kerneldoc, ideally with example
>   flows
> - umd need to ack this
>
> The other thing here is the async/nonblocking path. I think we still need
> that one, but again it should not sync with anything going on in execbuf,
> but simply execute the ioctl code in a kernel thread. The idea here is
> that this works like a special gpu engine, so that compute and vk can
> schedule bindings interleaved with rendering. This should be enough to get
> a performant vk sparse binding/textures implementation.
>
> But I'm not entirely sure on this one, so this definitely needs acks from
> umds.
>
> > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> > +A VM in VM_BIND mode will not support older execbuff mode of binding.
> > +
> > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> > +but those BOs should be mapped ahead via vm_bind ioctl.
>
> should or must?
>
> Also I'm not really sure that's a great interface. The batchbuffer really
> only needs to be an address, so maybe all we need is an extension to
> supply an u64 batchbuffer address instead of trying to retrofit this into
> an unfitting current uapi.
>
> And for implicit sync there's two things:
> - for vk I think the right uapi is the dma-buf fence import/export ioctls
>   from Jason Ekstrand. I think we should land that first instead of
>   hacking funny concepts together
> - for gl the dma-buf import/export might not be fast enough, since gl
>   needs to do a _lot_ of implicit sync. There we might need to use the
>   execbuffer buffer list, but then we should have extremely clear uapi
>   rules which disallow _everything_ except setting the explicit sync uapi
>
> Again all this stuff needs to be documented in detail in the kerneldoc
> uapi spec.
>
> > +VM_BIND features include,
> > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> > +  of an object (aliasing).
> > +- VA mapping can map to a partial section of the BO (partial binding).
> > +- Support capture of persistent mappings in the dump upon GPU error.
> > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> > +  usecases will be helpful.
> > +- Asynchronous vm_bind and vm_unbind support.
> > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> > +  and for signaling batch completion in long running contexts (explained
> > +  below).
>
> This should all be in the kerneldoc.
>
> > +VM_PRIVATE objects
> > +------------------
> > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> > +exported. Hence these BOs are referred to as Shared BOs.
> > +During each execbuff submission, the request fence must be added to the
> > +dma-resv fence list of all shared BOs mapped on the VM.
> > +
> > +VM_BIND feature introduces an optimization where user can create BO which
> > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> > +the VM they are private to and can't be dma-buf exported.
> > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> > +submission, they need only one dma-resv fence list updated. Thus the fast
> > +path (where required mappings are already bound) submission latency is O(1)
> > +w.r.t the number of VM private BOs.
>
> Two things:
>
> - I think the above is required to for initial vm_bind for vk, it kinda
>   doesn't make much sense without that, and will allow us to match amdgpu
>   and radeonsi
>
> - Christian König just landed ttm bulk lru helpers, and I think we need to
>   use those. This means vm_bind will only work with the ttm backend, but
>   that's what we have for the big dgpu where vm_bind helps more in terms
>   of performance, and the igfx conversion to ttm is already going on.
>
> Furthermore the i915 shrinker lru has stopped being an lru, so I think
> that should also be moved over to the ttm lru in some fashion to make sure
> we once again have a reasonable and consistent memory aging and reclaim
> architecture. The current code is just too much of a complete mess.
>
> And since this is all fairly integral to how the code arch works I don't
> think merging a different version which isn't based on ttm bulk lru
> helpers makes sense.
>
> Also I do think the page table lru handling needs to be included here,
> because that's another complete hand-rolled separate world for not much
> good reasons. I guess that can happen in parallel with the initial vm_bind
> bring-up, but it needs to be completed by the time we add the features
> beyond the initial support needed for vk.
>
> > +VM_BIND locking hirarchy
> > +-------------------------
> > +VM_BIND locking order is as below.
> > +
> > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> > +
> > +   In future, when GPU page faults are supported, we can potentially use a
> > +   rwsem instead, so that multiple pagefault handlers can take the read side
> > +   lock to lookup the mapping and hence can run in parallel.
> > +
> > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> > +   while binding a vma and while updating dma-resv fence list of a BO.
> > +   The private BOs of a VM will all share a dma-resv object.
> > +
> > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> > +   call for unbinding and during execbuff path for binding the mapping and
> > +   updating the dma-resv fence list of the BO.
> > +
> > +3) Spinlock/s to protect some of the VM's lists.
> > +
> > +We will also need support for bluk LRU movement of persistent mapping to
> > +avoid additional latencies in execbuff path.
>
> This needs more detail and explanation of how each level is required. Also
> the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>
> Like "some of the VM's lists" explains pretty much nothing.
>
> > +
> > +GPU page faults
> > +----------------
> > +Both older execbuff mode and the newer VM_BIND mode of binding will require
> > +using dma-fence to ensure residency.
> > +In future when GPU page faults are supported, no dma-fence usage is required
> > +as residency is purely managed by installing and removing/invalidating ptes.
>
> This is a bit confusing. I think one part of this should be moved into the
> section with future vm_bind use-cases (we're not going to support page
> faults with legacy softpin or even worse, relocations). The locking
> discussion should be part of the much longer list of uses cases that
> motivate the locking design.
>
> > +
> > +
> > +User/Memory Fence
> > +==================
> > +The idea is to take a user specified virtual address and install an interrupt
> > +handler to wake up the current task when the memory location passes the user
> > +supplied filter.
> > +
> > +User/Memory fence is a <address, value> pair. To signal the user fence,
> > +specified value will be written at the specified virtual address and
> > +wakeup the waiting process. User can wait on an user fence with the
> > +gem_wait_user_fence ioctl.
> > +
> > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> > +interrupt within their batches after updating the value to have sub-batch
> > +precision on the wakeup. Each batch can signal an user fence to indicate
> > +the completion of next level batch. The completion of very first level batch
> > +needs to be signaled by the command streamer. The user must provide the
> > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> > +extension of execbuff ioctl, so that KMD can setup the command streamer to
> > +signal it.
> > +
> > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> > +the user process after completion of an asynchronous operation.
> > +
> > +When VM_BIND ioctl was provided with a user/memory fence via the
> > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> > +of binding of that mapping. All async binds/unbinds are serialized, hence
> > +signaling of user/memory fence also indicate the completion of all previous
> > +binds/unbinds.
> > +
> > +This feature will be derived from the below original work:
> > +https://patchwork.freedesktop.org/patch/349417/
>
> This is 1:1 tied to long running compute mode contexts (which in the uapi
> doc must reference the endless amounts of bikeshed summary we have in the
> docs about indefinite fences).
>
> I'd put this into a new section about compute and userspace memory fences
> support, with this and the next chapter ...
> > +
> > +
> > +VM_BIND use cases
> > +==================
>
> ... and then make this section here focus entirely on additional vm_bind
> use-cases that we'll be adding later on. Which doesn't need to go into any
> details, it's just justification for why we want to build the world on top
> of vm_bind.
>
> > +
> > +Long running Compute contexts
> > +------------------------------
> > +Usage of dma-fence expects that they complete in reasonable amount of time.
> > +Compute on the other hand can be long running. Hence it is appropriate for
> > +compute to use user/memory fence and dma-fence usage will be limited to
> > +in-kernel consumption only. This requires an execbuff uapi extension to pass
> > +in user fence. Compute must opt-in for this mechanism with
> > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> > +
> > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> > +and implicit dependency setting is not allowed on long running contexts.
> > +
> > +Where GPU page faults are not available, kernel driver upon buffer invalidation
> > +will initiate a suspend (preemption) of long running context with a dma-fence
> > +attached to it. And upon completion of that suspend fence, finish the
> > +invalidation, revalidate the BO and then resume the compute context. This is
> > +done by having a per-context fence (called suspend fence) proxying as
> > +i915_request fence. This suspend fence is enabled when there is a wait on it,
> > +which triggers the context preemption.
> > +
> > +This is much easier to support with VM_BIND compared to the current heavier
> > +execbuff path resource attachment.
>
> There's a bunch of tricky code around compute mode context support, like
> the preempt ctx fence (or suspend fence or whatever you want to call it),
> and the resume work. And I think that code should be shared across
> drivers.
>
> I think the right place to put this is into drm/sched, somewhere attached
> to the drm_sched_entity structure. I expect i915 folks to collaborate with
> amd and ideally also get amdkfd to adopt the same thing if possible. At
> least Christian has mentioned in the past that he's a bit unhappy about
> how this works.
>
> Also drm/sched has dependency tracking, which will be needed to pipeline
> context resume operations. That needs to be used instead of i915-gem
> inventing yet another dependency tracking data structure (it already has 3
> and that's roughly 3 too many).
>
> This means compute mode support and userspace memory fences are blocked on
> the drm/sched conversion, but *eh* add it to the list of reasons for why
> drm/sched needs to happen.
>
> Also since we only have support for compute mode ctx in our internal tree
> with the guc scheduler backend anyway, and the first conversion target is
> the guc backend, I don't think this actually holds up a lot of the code.
>
> > +Low Latency Submission
> > +-----------------------
> > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>
> This is really just a special case of compute mode contexts, I think I'd
> include that in there, but explain better what it requires (i.e. vm_bind
> not being synchronized against execbuf).
>
> > +
> > +Debugger
> > +---------
> > +With debug event interface user space process (debugger) is able to keep track
> > +of and act upon resources created by another process (debuggee) and attached
> > +to GPU via vm_bind interface.
> > +
> > +Mesa/Valkun
> > +------------
> > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> > +performance. For Vulkan it should be straightforward to use VM_BIND.
> > +For Iris implicit buffer tracking must be implemented before we can harness
> > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> > +overhead becomes more important.
>
> Just to clarify, I don't think we can land vm_bind into upstream if it
> doesn't work 100% for vk. There's a bit much "can" instead of "will in
> this section".
>
> > +
> > +Page level hints settings
> > +--------------------------
> > +VM_BIND allows any hints setting per mapping instead of per BO.
> > +Possible hints include read-only, placement and atomicity.
> > +Sub-BO level placement hint will be even more relevant with
> > +upcoming GPU on-demand page fault support.
> > +
> > +Page level Cache/CLOS settings
> > +-------------------------------
> > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> > +
> > +Shared Virtual Memory (SVM) support
> > +------------------------------------
> > +VM_BIND interface can be used to map system memory directly (without gem BO
> > +abstraction) using the HMM interface.
>
> Userptr is absent here (and it's not the same as svm, at least on
> discrete), and this is needed for the initial version since otherwise vk
> can't use it because we're not at feature parity.
>
> Irc discussions by Maarten and Dave came up with the idea that maybe
> userptr for vm_bind should work _without_ any gem bo as backing storage,
> since that guarantees that people don't come up with funny ideas like
> trying to share such bo across process or mmap it and other nonsense which
> just doesn't work.
>
> > +
> > +
> > +Broder i915 cleanups
> > +=====================
> > +Supporting this whole new vm_bind mode of binding which comes with its own
> > +usecases to support and the locking requirements requires proper integration
> > +with the existing i915 driver. This calls for some broader i915 driver
> > +cleanups/simplifications for maintainability of the driver going forward.
> > +Here are few things identified and are being looked into.
> > +
> > +- Make pagetable allocations evictable and manage them similar to VM_BIND
> > +  mapped objects. Page table pages are similar to persistent mappings of a
> > +  VM (difference here are that the page table pages will not
> > +  have an i915_vma structure and after swapping pages back in, parent page
> > +  link needs to be updated).
>
> See above, but I think this should be included as part of the initial
> vm_bind push.
>
> > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> > +  do not use it and complexity it brings in is probably more than the
> > +  performance advantage we get in legacy execbuff case.
> > +- Remove vma->open_count counting
> > +- Remove i915_vma active reference tracking. Instead use underlying BO's
> > +  dma-resv fence list to determine if a i915_vma is active or not.
>
> So this is a complete mess, and really should not exist. I think it needs
> to be removed before we try to make i915_vma even more complex by adding
> vm_bind.
>
> The other thing I've been pondering here is that vm_bind is really
> completely different from legacy vm structures for a lot of reasons:
> - no relocation or softpin handling, which means vm_bind has no reason to
>   ever look at the i915_vma structure in execbuf code. Unfortunately
>   execbuf has been rewritten to be vma instead of obj centric, so it's a
>   100% mismatch
>
> - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>   that because the kernel manages the virtual address space fully. Again
>   ideally that entire vma_move_to_active code and everything related to it
>   would simply not exist.
>
> - similar on the eviction side, the rules are quite different: For vm_bind
>   we never tear down the vma, instead it's just moved to the list of
>   evicted vma. Legacy vm have no need for all these additional lists, so
>   another huge confusion.
>
> - if the refcount is done correctly for vm_bind we wouldn't need the
>   tricky code in the bo close paths. Unfortunately legacy vm with
>   relocations and softpin require that vma are only a weak reference, so
>   that cannot be removed.
>
> - there's also a ton of special cases for ggtt handling, like the
>   different views (for display, partial views for mmap), but also the
>   gen2/3 alignment and padding requirements which vm_bind never needs.
>
> I think the right thing here is to massively split the implementation
> behind some solid vm/vma abstraction, with a base clase for vm and vma
> which _only_ has the pieces which both vm_bind and the legacy vm stuff
> needs. But it's a bit tricky to get there. I think a workable path would
> be:
> - Add a new base class to both i915_address_space and i915_vma, which
>   starts out empty.
>
> - As vm_bind code lands, move things that vm_bind code needs into these
>   base classes
>
> - The goal should be that these base classes are a stand-alone library
>   that other drivers could reuse. Like we've done with the buddy
>   allocator, which first moved from i915-gem to i915-ttm, and which amd
>   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>   interested in adding something like vm_bind should be involved from the
>   start (or maybe the entire thing reused in amdgpu, they're looking at
>   vk sparse binding support too or at least have perf issues I think).
>
> - Locking must be the same across all implemntations, otherwise it's
>   really not an abstract. i915 screwed this up terribly by having
>   different locking rules for ppgtt and ggtt, which is just nonsense.
>
> - The legacy specific code needs to be extracted as much as possible and
>   shoved into separate files. In execbuf this means we need to get back to
>   object centric flow, and the slowpaths need to become a lot simpler
>   again (Maarten has cleaned up some of this, but there's still a silly
>   amount of hacks in there with funny layering).
>
> - I think if stuff like the vma eviction details (list movement and
>   locking and refcounting of the underlying object)
>
> > +
> > +These can be worked upon after intitial vm_bind support is added.
>
> I don't think that works, given how badly i915-gem team screwed up in
> other places. And those places had to be fixed by adopting shared code
> like ttm. Plus there's already a huge unfulffiled promise pending with the
> drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>
> Cheers, Daniel
>
> > +
> > +
> > +UAPI
> > +=====
> > +Uapi definiton can be found here:
> > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > index 91e93a705230..7d10c36b268d 100644
> > --- a/Documentation/gpu/rfc/index.rst
> > +++ b/Documentation/gpu/rfc/index.rst
> > @@ -23,3 +23,7 @@ host such documentation:
> >  .. toctree::
> >
> >      i915_scheduler.rst
> > +
> > +.. toctree::
> > +
> > +    i915_vm_bind.rst
> > --
> > 2.21.0.rc0.32.g243a4c7e27
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 2/2] drm/doc/rfc: VM_BIND uapi definition
  2022-03-30 12:51   ` Daniel Vetter
@ 2022-04-20 20:18     ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-20 20:18 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: daniel.vetter, intel-gfx, thomas.hellstrom, chris.p.wilson, dri-devel

On Wed, Mar 30, 2022 at 02:51:41PM +0200, Daniel Vetter wrote:
>On Mon, Mar 07, 2022 at 12:31:46PM -0800, Niranjana Vishwanathapura wrote:
>> VM_BIND und related uapi definitions
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++
>
>Maybe as the top level comment: The point of documenting uapi isn't to
>just spell out all the fields, but to define _how_ and _why_ things work.
>This part is completely missing from these docs here.
>

Thanks Daniel,

Some of the documentation is in the rst file.
Ok, will add documentation here on _how and _why_.

>>  1 file changed, 176 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h
>> new file mode 100644
>> index 000000000000..80f00ee6c8a1
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>
>You need to incldue this somewhere so it's rendered, see the previous
>examples.

Looking at previous examples, my understanding is this is just a documentation
file at this point which goes into Documentation/gpu/rfc folder and will have to
remove it later once the actual uapi changes lands in include/uapi/drm/i915_drm.h.
Let me know if that is incorrect and needs change.

>
>> @@ -0,0 +1,176 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/* VM_BIND feature availability through drm_i915_getparam */
>> +#define I915_PARAM_HAS_VM_BIND		57
>
>Needs to be kernel-docified, which means we need a prep patch that fixes
>up the existing mess.
>

Ok on kernel-doc, but as mentioned above, I am not sure we need prep
patch that fixes up other existing fields at this point.

>> +
>> +/* VM_BIND related ioctls */
>> +#define DRM_I915_GEM_VM_BIND		0x3d
>> +#define DRM_I915_GEM_VM_UNBIND		0x3e
>> +#define DRM_I915_GEM_WAIT_USER_FENCE	0x3f
>> +
>> +#define DRM_IOCTL_I915_GEM_VM_BIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND		DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
>> +
>> +/**
>> + * struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind.
>
>Both binding and unbinding need to specify in excruciating detail what
>happens if there's overlaps (existing mappings, or unmapping a range which
>has no mapping, or only partially full of maps or different objects) and
>fun stuff like that.
>

Ok, will add those details.

>> + */
>> +struct drm_i915_gem_vm_bind {
>> +	/** vm to [un]bind */
>> +	__u32 vm_id;
>> +
>> +	/**
>> +	 * BO handle or file descriptor.
>> +	 * 'fd' value of -1 is reserved for system pages (SVM)
>> +	 */
>> +	union {
>> +		__u32 handle; /* For unbind, it is reserved and must be 0 */
>
>I think it'd be a lot cleaner if we do a bind and an unbind struct for
>these, instead of mixing it up.
>

Ok

>Also I thought mesa requested to be able to unmap an object from a vm
>without a range. Has that been dropped, and confirmed to not be needed.
>

Hmm...I think it was other way around. ie., to unmap with a range in vm
but without an object. We already support that.

>> +		__s32 fd;
>
>If we don't need it right away then don't add it yet. If it's planned to
>be used then it needs to be documented, but I kinda have no idea why you'd
>need an fd for svm?
>

It is not required for SVM, it was intended for future expanstions and '-1'
was reserved for SVM.
Ok, will remove it for now.

>> +	}
>> +
>> +	/** VA start to [un]bind */
>> +	__u64 start;
>> +
>> +	/** Offset in object to [un]bind */
>> +	__u64 offset;
>> +
>> +	/** VA length to [un]bind */
>> +	__u64 length;
>> +
>> +	/** Flags */
>> +	__u64 flags;
>> +	/** Bind the mapping immediately instead of during next submission */
>
>This aint kerneldoc.
>
>Also this needs to specify in much more detail what exactly this means,
>and also how it interacts with execbuf.
>

Ok

>So the patch here probably needs to include the missing pieces on the
>execbuf side of things. Like how does execbuf work when it's used with a
>vm_bind managed vm? That means:
>- document the pieces that are there
>- then add a patch to document how that all changes with vm_bind

Hmm, I am bit confused. The current execbuff handling documentation is in
i915_gem_execbuffer.c. Not sure how to update it in this design RFC patch.
With VM_BIND support, we only support vm_bind vmas in the execbuff and
based on comments from other patch in this series, we probably should not
allow any execlist entries in vm_bind mode (no implicit syncing and use
an extension for the batch address). May be I can update the rst file
in this series for these information for now. Thoughts?

>
>And do that for everything execbuf can do.
>
>> +#define I915_GEM_VM_BIND_IMMEDIATE   (1 << 0)
>> +	/** Read-only mapping */
>> +#define I915_GEM_VM_BIND_READONLY    (1 << 1)
>> +	/** Capture this mapping in the dump upon GPU error */
>> +#define I915_GEM_VM_BIND_CAPTURE     (1 << 2)
>> +
>> +	/** Zero-terminated chain of extensions */
>> +	__u64 extensions;
>> +};
>> +
>> +/**
>> + * struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension.
>> + */
>> +struct drm_i915_vm_bind_ext_user_fence {
>> +#define I915_VM_BIND_EXT_USER_FENCE	0
>> +	/** @base: Extension link. See struct i915_user_extension. */
>> +	struct i915_user_extension base;
>> +
>> +	/** User/Memory fence qword alinged process virtual address */
>> +	__u64 addr;
>> +
>> +	/** User/Memory fence value to be written after bind completion */
>> +	__u64 val;
>> +
>> +	/** Reserved for future extensions */
>> +	__u64 rsvd;
>> +};
>> +
>> +/**
>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
>> + * signaling extension.
>> + *
>> + * This extension allows user to attach a user fence (<addr, value> pair) to an
>> + * execbuf to be signaled by the command streamer after the completion of 1st
>> + * level batch, by writing the <value> at specified <addr> and triggering an
>> + * interrupt.
>> + * User can either poll for this user fence to signal or can also wait on it
>> + * with i915_gem_wait_user_fence ioctl.
>> + * This is very much usefaul for long running contexts where waiting on dma-fence
>> + * by user (like i915_gem_wait ioctl) is not supported.
>> + */
>> +struct drm_i915_gem_execbuffer_ext_user_fence {
>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE		0
>> +	/** @base: Extension link. See struct i915_user_extension. */
>> +	struct i915_user_extension base;
>> +
>> +	/**
>> +	 * User/Memory fence qword aligned GPU virtual address.
>> +	 * Address has to be a valid GPU virtual address at the time of
>> +	 * 1st level batch completion.
>> +	 */
>> +	__u64 addr;
>> +
>> +	/**
>> +	 * User/Memory fence Value to be written to above address
>> +	 * after 1st level batch completes.
>> +	 */
>> +	__u64 value;
>> +
>> +	/** Reserved for future extensions */
>> +	__u64 rsvd;
>> +};
>> +
>> +struct drm_i915_gem_vm_control {
>> +/** Flag to opt-in for VM_BIND mode of binding during VM creation */
>
>This is very confusingly docunmented and I have no idea how you're going
>to use an empty extension. Also it's not kerneldoc.
>

Yah, I was also wondering how to define new flags bits for the flags
in structures already defined in i915_drm.h.
Ok, will just define the flag bit definition here and mention the
sturcture field in the documentation part.

>Please check that the stuff you're creating renders properly in the html
>output.
>
>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND	(1 << 0)
>> +};
>> +
>> +
>> +struct drm_i915_gem_create_ext {
>> +/** Extension to make the object private to a specified VM */
>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE		2
>
>Why 2?
>
>Also this all needs to be documented what it precisely means.
>

Because 0 and 1 are already taken (I915_GEM_CREATE_EXT_* in i915_drm.h).
Ok, will add required documentation.

>> +};
>> +
>> +
>> +struct prelim_drm_i915_gem_context_create_ext {
>> +/** Flag to declare context as long running */
>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>
>The compute mode context, again including full impact on execbuf, is not
>documented here. This also means any gaps in the context uapi
>documentation need to be filled first in prep patches.
>

Ok, will add documentation here.
As mentioned above, I guess the prep patch will later once this
RFC patch gets accepted?

>Also memory fences are extremely tricky, we need to specify in detail when
>they're allowed to be used and when not. This needs to reference the
>relevant sections from the dma-fence docs.
>

Ok

>> +};
>> +
>> +/**
>> + * struct drm_i915_gem_wait_user_fence
>> + *
>> + * Wait on user/memory fence. User/Memory fence can be woken up either by,
>> + *    1. GPU context indicated by 'ctx_id', or,
>> + *    2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
>> + *       'ctx_id' is ignored when this flag is set.
>> + *
>> + * Wakeup when below condition is true.
>> + * (*addr & MASK) OP (VALUE & MASK)
>> + *
>> + */
>> +~struct drm_i915_gem_wait_user_fence {
>> +	/** @base: Extension link. See struct i915_user_extension. */
>> +	__u64 extensions;
>> +
>> +	/** User/Memory fence address */
>> +	__u64 addr;
>> +
>> +	/** Id of the Context which will signal the fence. */
>> +	__u32 ctx_id;
>> +
>> +	/** Wakeup condition operator */
>> +	__u16 op;
>> +#define I915_UFENCE_WAIT_EQ      0
>> +#define I915_UFENCE_WAIT_NEQ     1
>> +#define I915_UFENCE_WAIT_GT      2
>> +#define I915_UFENCE_WAIT_GTE     3
>> +#define I915_UFENCE_WAIT_LT      4
>> +#define I915_UFENCE_WAIT_LTE     5
>> +#define I915_UFENCE_WAIT_BEFORE  6
>> +#define I915_UFENCE_WAIT_AFTER   7
>> +
>> +	/** Flags */
>> +	__u16 flags;
>> +#define I915_UFENCE_WAIT_SOFT    0x1
>> +#define I915_UFENCE_WAIT_ABSTIME 0x2
>> +
>> +	/** Wakeup value */
>> +	__u64 value;
>> +
>> +	/** Wakeup mask */
>> +	__u64 mask;
>> +#define I915_UFENCE_WAIT_U8     0xffu
>> +#define I915_UFENCE_WAIT_U16    0xffffu
>> +#define I915_UFENCE_WAIT_U32    0xfffffffful
>> +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull
>
>Do we really need all these flags, and does the hw really support all the
>combinations? Anything the hw doesn't support in MI_SEMAPHORE is pretty
>much useless as a umf (userspace memory fence) mode.
>

Hmm...The PIPE_CONTROL/MI_FLUSH instructions (used for wakup) support 64-bit
writes. The gem_wait_user_fence ioctl wakup condition is,
(*addr & MASK) OP (VALUE & MASK)
So, these values provide user options to configure wakeup.

The MI_SEMAPHORE seems to only support 32-bit value check for wakeup.
But that is different from the above gem_wait_user_fence ioctl wakeup.


>> +
>> +	/** Timeout */
>
>Needs to specificy the clock source.

Ok,

Niranjana

>-Daniel
>
>> +	__s64 timeout;
>> +};
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>
>
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-31  8:28     ` Daniel Vetter
@ 2022-04-20 22:45       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-20 22:45 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: chris.p.wilson, dri-devel, Dave Airlie, intel-gfx, Bloomfield,
	Jon, Ben Skeggs, Jason Ekstrand, daniel.vetter, thomas.hellstrom,
	Christian König

On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
>Adding a pile of people who've expressed interest in vm_bind for their
>drivers.
>
>Also note to the intel folks: This is largely written with me having my
>subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>here for the subsystem at large. There is substantial rework involved
>here, but it's not any different from i915 adopting ttm or i915 adpoting
>drm/sched, and I do think this stuff needs to happen in one form or
>another.
>
>On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  2 files changed, 214 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..cdc6bb25b942
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,210 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> +a specified address space (VM).
>> +
>> +These mappings (also referred to as persistent mappings) will be persistent
>> +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> +having to provide a list of all required mappings during each submission
>> +(as required by older execbuff mode).
>> +
>> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> +flag is set (useful if mapping is required for an active context).
>
>So this is a screw-up I've done, and for upstream I think we need to fix
>it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>I was wrong suggesting we should do this a few years back when we kicked
>this off internally :-(
>
>What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>things on top:
>- in and out fences, like with execbuf, to allow userspace to sync with
>  execbuf as needed
>- for compute-mode context this means userspace memory fences
>- for legacy context this means a timeline syncobj in drm_syncobj
>
>No sync_file or anything else like this at all. This means a bunch of
>work, but also it'll have benefits because it means we should be able to
>use exactly the same code paths and logic for both compute and for legacy
>context, because drm_syncobj support future fence semantics.
>

Thanks Daniel,
Ok, will update

>Also on the implementation side we still need to install dma_fence to the
>various dma_resv, and for this we need the new dma_resv_usage series from
>Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>flag to make sure they never result in an oversync issue with execbuf. I
>don't think trying to land vm_bind without that prep work in
>dma_resv_usage makes sense.
>

Ok, but that is not a dependency for this VM_BIND design RFC patch right?
I will add this to the documentation here.

>Also as soon as dma_resv_usage has landed there's a few cleanups we should
>do in i915:
>- ttm bo moving code should probably simplify a bit (and maybe more of the
>  code should be pushed as helpers into ttm)
>- clflush code should be moved over to using USAGE_KERNEL and the various
>  hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>  expand on the kernel-doc for cache_dirty") for a bit more context
>
>This is still not yet enough, since if a vm_bind races with an eviction we
>might stall on the new buffers being readied first before the context can
>continue. This needs some care to make sure that vma which aren't fully
>bound yet are on a separate list, and vma which are marked for unbinding
>are removed from the main working set list as soon as possible.
>
>All of these things are relevant for the uapi semantics, which means
>- they need to be documented in the uapi kerneldoc, ideally with example
>  flows
>- umd need to ack this
>

Ok

>The other thing here is the async/nonblocking path. I think we still need
>that one, but again it should not sync with anything going on in execbuf,
>but simply execute the ioctl code in a kernel thread. The idea here is
>that this works like a special gpu engine, so that compute and vk can
>schedule bindings interleaved with rendering. This should be enough to get
>a performant vk sparse binding/textures implementation.
>
>But I'm not entirely sure on this one, so this definitely needs acks from
>umds.
>
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> +
>> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> +but those BOs should be mapped ahead via vm_bind ioctl.
>
>should or must?
>

Must, will fix.

>Also I'm not really sure that's a great interface. The batchbuffer really
>only needs to be an address, so maybe all we need is an extension to
>supply an u64 batchbuffer address instead of trying to retrofit this into
>an unfitting current uapi.
>

Yah, this was considered, but was decided to do it as later optimization.
But if we were to remove execlist entries completely (ie., no implicit
sync also), then we need to do this from the beginning.

>And for implicit sync there's two things:
>- for vk I think the right uapi is the dma-buf fence import/export ioctls
>  from Jason Ekstrand. I think we should land that first instead of
>  hacking funny concepts together

I did not understand fully, can you point to it?

>- for gl the dma-buf import/export might not be fast enough, since gl
>  needs to do a _lot_ of implicit sync. There we might need to use the
>  execbuffer buffer list, but then we should have extremely clear uapi
>  rules which disallow _everything_ except setting the explicit sync uapi
>

Ok, so then, we still need to support implicit sync in vm_bind mode. Right?

>Again all this stuff needs to be documented in detail in the kerneldoc
>uapi spec.
>

ok

>> +VM_BIND features include,
>> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +- VA mapping can map to a partial section of the BO (partial binding).
>> +- Support capture of persistent mappings in the dump upon GPU error.
>> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  usecases will be helpful.
>> +- Asynchronous vm_bind and vm_unbind support.
>> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> +  and for signaling batch completion in long running contexts (explained
>> +  below).
>
>This should all be in the kerneldoc.
>

ok

>> +VM_PRIVATE objects
>> +------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>Two things:
>
>- I think the above is required to for initial vm_bind for vk, it kinda
>  doesn't make much sense without that, and will allow us to match amdgpu
>  and radeonsi
>
>- Christian König just landed ttm bulk lru helpers, and I think we need to
>  use those. This means vm_bind will only work with the ttm backend, but
>  that's what we have for the big dgpu where vm_bind helps more in terms
>  of performance, and the igfx conversion to ttm is already going on.
>

ok

>Furthermore the i915 shrinker lru has stopped being an lru, so I think
>that should also be moved over to the ttm lru in some fashion to make sure
>we once again have a reasonable and consistent memory aging and reclaim
>architecture. The current code is just too much of a complete mess.
>
>And since this is all fairly integral to how the code arch works I don't
>think merging a different version which isn't based on ttm bulk lru
>helpers makes sense.
>
>Also I do think the page table lru handling needs to be included here,
>because that's another complete hand-rolled separate world for not much
>good reasons. I guess that can happen in parallel with the initial vm_bind
>bring-up, but it needs to be completed by the time we add the features
>beyond the initial support needed for vk.
>

Ok

>> +VM_BIND locking hirarchy
>> +-------------------------
>> +VM_BIND locking order is as below.
>> +
>> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple pagefault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +
>> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> +   while binding a vma and while updating dma-resv fence list of a BO.
>> +   The private BOs of a VM will all share a dma-resv object.
>> +
>> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> +   call for unbinding and during execbuff path for binding the mapping and
>> +   updating the dma-resv fence list of the BO.
>> +
>> +3) Spinlock/s to protect some of the VM's lists.
>> +
>> +We will also need support for bluk LRU movement of persistent mapping to
>> +avoid additional latencies in execbuff path.
>
>This needs more detail and explanation of how each level is required. Also
>the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>
>Like "some of the VM's lists" explains pretty much nothing.
>

Ok, will explain.

>> +
>> +GPU page faults
>> +----------------
>> +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> +using dma-fence to ensure residency.
>> +In future when GPU page faults are supported, no dma-fence usage is required
>> +as residency is purely managed by installing and removing/invalidating ptes.
>
>This is a bit confusing. I think one part of this should be moved into the
>section with future vm_bind use-cases (we're not going to support page
>faults with legacy softpin or even worse, relocations). The locking
>discussion should be part of the much longer list of uses cases that
>motivate the locking design.
>

Ok, will move.

>> +
>> +
>> +User/Memory Fence
>> +==================
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter.
>> +
>> +User/Memory fence is a <address, value> pair. To signal the user fence,
>> +specified value will be written at the specified virtual address and
>> +wakeup the waiting process. User can wait on an user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal an user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>
>This is 1:1 tied to long running compute mode contexts (which in the uapi
>doc must reference the endless amounts of bikeshed summary we have in the
>docs about indefinite fences).
>

Ok, will check and add reference.

>I'd put this into a new section about compute and userspace memory fences
>support, with this and the next chapter ...

ok

>> +
>> +
>> +VM_BIND use cases
>> +==================
>
>... and then make this section here focus entirely on additional vm_bind
>use-cases that we'll be adding later on. Which doesn't need to go into any
>details, it's just justification for why we want to build the world on top
>of vm_bind.
>

ok

>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence. Compute must opt-in for this mechanism with
>> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> +
>> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> +and implicit dependency setting is not allowed on long running contexts.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context fence (called suspend fence) proxying as
>> +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> +which triggers the context preemption.
>> +
>> +This is much easier to support with VM_BIND compared to the current heavier
>> +execbuff path resource attachment.
>
>There's a bunch of tricky code around compute mode context support, like
>the preempt ctx fence (or suspend fence or whatever you want to call it),
>and the resume work. And I think that code should be shared across
>drivers.
>
>I think the right place to put this is into drm/sched, somewhere attached
>to the drm_sched_entity structure. I expect i915 folks to collaborate with
>amd and ideally also get amdkfd to adopt the same thing if possible. At
>least Christian has mentioned in the past that he's a bit unhappy about
>how this works.
>
>Also drm/sched has dependency tracking, which will be needed to pipeline
>context resume operations. That needs to be used instead of i915-gem
>inventing yet another dependency tracking data structure (it already has 3
>and that's roughly 3 too many).
>
>This means compute mode support and userspace memory fences are blocked on
>the drm/sched conversion, but *eh* add it to the list of reasons for why
>drm/sched needs to happen.
>
>Also since we only have support for compute mode ctx in our internal tree
>with the guc scheduler backend anyway, and the first conversion target is
>the guc backend, I don't think this actually holds up a lot of the code.
>

Hmm...ok. Currently, the context suspend and resume operations in out
internal tree is through an orthogonal guc interface (not through scheduler).
So, I need to look more into this part.

>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>
>This is really just a special case of compute mode contexts, I think I'd
>include that in there, but explain better what it requires (i.e. vm_bind
>not being synchronized against execbuf).
>

ok

>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debuggee) and attached
>> +to GPU via vm_bind interface.
>> +
>> +Mesa/Valkun
>> +------------
>> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> +performance. For Vulkan it should be straightforward to use VM_BIND.
>> +For Iris implicit buffer tracking must be implemented before we can harness
>> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> +overhead becomes more important.
>
>Just to clarify, I don't think we can land vm_bind into upstream if it
>doesn't work 100% for vk. There's a bit much "can" instead of "will in
>this section".
>

ok, will explain better.

>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface.
>
>Userptr is absent here (and it's not the same as svm, at least on
>discrete), and this is needed for the initial version since otherwise vk
>can't use it because we're not at feature parity.
>

userptr gem objects are supported in initial version (and yes it is not
same as SVM). I did not add it here as there is no additional uapi
change required to support that.

>Irc discussions by Maarten and Dave came up with the idea that maybe
>userptr for vm_bind should work _without_ any gem bo as backing storage,
>since that guarantees that people don't come up with funny ideas like
>trying to share such bo across process or mmap it and other nonsense which
>just doesn't work.
>

Hmm...there is no plan to support userptr _without_ gem bo not atleast in
the initial vm_bind support. Is it Ok to put it in the 'futues' section?

>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +usecases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> +  mapped objects. Page table pages are similar to persistent mappings of a
>> +  VM (difference here are that the page table pages will not
>> +  have an i915_vma structure and after swapping pages back in, parent page
>> +  link needs to be updated).
>
>See above, but I think this should be included as part of the initial
>vm_bind push.
>

Ok, as you mentioned above, we can do it soon after initial vm_bind support
lands, but before we add any new vm_bind features.

>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> +  dma-resv fence list to determine if a i915_vma is active or not.
>
>So this is a complete mess, and really should not exist. I think it needs
>to be removed before we try to make i915_vma even more complex by adding
>vm_bind.
>

Hmm...Need to look into this. I am not sure how much of an effort it is going
to be to remove i915_vma active reference tracking and instead use dma_resv
fences for activeness tracking.

>The other thing I've been pondering here is that vm_bind is really
>completely different from legacy vm structures for a lot of reasons:
>- no relocation or softpin handling, which means vm_bind has no reason to
>  ever look at the i915_vma structure in execbuf code. Unfortunately
>  execbuf has been rewritten to be vma instead of obj centric, so it's a
>  100% mismatch
>
>- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>  that because the kernel manages the virtual address space fully. Again
>  ideally that entire vma_move_to_active code and everything related to it
>  would simply not exist.
>
>- similar on the eviction side, the rules are quite different: For vm_bind
>  we never tear down the vma, instead it's just moved to the list of
>  evicted vma. Legacy vm have no need for all these additional lists, so
>  another huge confusion.
>
>- if the refcount is done correctly for vm_bind we wouldn't need the
>  tricky code in the bo close paths. Unfortunately legacy vm with
>  relocations and softpin require that vma are only a weak reference, so
>  that cannot be removed.
>
>- there's also a ton of special cases for ggtt handling, like the
>  different views (for display, partial views for mmap), but also the
>  gen2/3 alignment and padding requirements which vm_bind never needs.
>
>I think the right thing here is to massively split the implementation
>behind some solid vm/vma abstraction, with a base clase for vm and vma
>which _only_ has the pieces which both vm_bind and the legacy vm stuff
>needs. But it's a bit tricky to get there. I think a workable path would
>be:
>- Add a new base class to both i915_address_space and i915_vma, which
>  starts out empty.
>
>- As vm_bind code lands, move things that vm_bind code needs into these
>  base classes
>

Ok

>- The goal should be that these base classes are a stand-alone library
>  that other drivers could reuse. Like we've done with the buddy
>  allocator, which first moved from i915-gem to i915-ttm, and which amd
>  now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>  interested in adding something like vm_bind should be involved from the
>  start (or maybe the entire thing reused in amdgpu, they're looking at
>  vk sparse binding support too or at least have perf issues I think).
>
>- Locking must be the same across all implemntations, otherwise it's
>  really not an abstract. i915 screwed this up terribly by having
>  different locking rules for ppgtt and ggtt, which is just nonsense.
>
>- The legacy specific code needs to be extracted as much as possible and
>  shoved into separate files. In execbuf this means we need to get back to
>  object centric flow, and the slowpaths need to become a lot simpler
>  again (Maarten has cleaned up some of this, but there's still a silly
>  amount of hacks in there with funny layering).
>

This also, we can do soon after vm_bind code lands right?

>- I think if stuff like the vma eviction details (list movement and
>  locking and refcounting of the underlying object)
>
>> +
>> +These can be worked upon after intitial vm_bind support is added.
>
>I don't think that works, given how badly i915-gem team screwed up in
>other places. And those places had to be fixed by adopting shared code
>like ttm. Plus there's already a huge unfulffiled promise pending with the
>drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>

Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
active reference tracking code from i915 driver. Wonder if there is any
middle ground here like not using that in vm_bind mode?

Niranjana

>Cheers, Daniel
>
>> +
>> +
>> +UAPI
>> +=====
>> +Uapi definiton can be found here:
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>
>
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-04-20 22:45       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-20 22:45 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: chris.p.wilson, dri-devel, Dave Airlie, intel-gfx, Daniel Stone,
	Ben Skeggs, daniel.vetter, thomas.hellstrom,
	Christian König

On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
>Adding a pile of people who've expressed interest in vm_bind for their
>drivers.
>
>Also note to the intel folks: This is largely written with me having my
>subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>here for the subsystem at large. There is substantial rework involved
>here, but it's not any different from i915 adopting ttm or i915 adpoting
>drm/sched, and I do think this stuff needs to happen in one form or
>another.
>
>On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  2 files changed, 214 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..cdc6bb25b942
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,210 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> +a specified address space (VM).
>> +
>> +These mappings (also referred to as persistent mappings) will be persistent
>> +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> +having to provide a list of all required mappings during each submission
>> +(as required by older execbuff mode).
>> +
>> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> +flag is set (useful if mapping is required for an active context).
>
>So this is a screw-up I've done, and for upstream I think we need to fix
>it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>I was wrong suggesting we should do this a few years back when we kicked
>this off internally :-(
>
>What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>things on top:
>- in and out fences, like with execbuf, to allow userspace to sync with
>  execbuf as needed
>- for compute-mode context this means userspace memory fences
>- for legacy context this means a timeline syncobj in drm_syncobj
>
>No sync_file or anything else like this at all. This means a bunch of
>work, but also it'll have benefits because it means we should be able to
>use exactly the same code paths and logic for both compute and for legacy
>context, because drm_syncobj support future fence semantics.
>

Thanks Daniel,
Ok, will update

>Also on the implementation side we still need to install dma_fence to the
>various dma_resv, and for this we need the new dma_resv_usage series from
>Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>flag to make sure they never result in an oversync issue with execbuf. I
>don't think trying to land vm_bind without that prep work in
>dma_resv_usage makes sense.
>

Ok, but that is not a dependency for this VM_BIND design RFC patch right?
I will add this to the documentation here.

>Also as soon as dma_resv_usage has landed there's a few cleanups we should
>do in i915:
>- ttm bo moving code should probably simplify a bit (and maybe more of the
>  code should be pushed as helpers into ttm)
>- clflush code should be moved over to using USAGE_KERNEL and the various
>  hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>  expand on the kernel-doc for cache_dirty") for a bit more context
>
>This is still not yet enough, since if a vm_bind races with an eviction we
>might stall on the new buffers being readied first before the context can
>continue. This needs some care to make sure that vma which aren't fully
>bound yet are on a separate list, and vma which are marked for unbinding
>are removed from the main working set list as soon as possible.
>
>All of these things are relevant for the uapi semantics, which means
>- they need to be documented in the uapi kerneldoc, ideally with example
>  flows
>- umd need to ack this
>

Ok

>The other thing here is the async/nonblocking path. I think we still need
>that one, but again it should not sync with anything going on in execbuf,
>but simply execute the ioctl code in a kernel thread. The idea here is
>that this works like a special gpu engine, so that compute and vk can
>schedule bindings interleaved with rendering. This should be enough to get
>a performant vk sparse binding/textures implementation.
>
>But I'm not entirely sure on this one, so this definitely needs acks from
>umds.
>
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> +
>> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> +but those BOs should be mapped ahead via vm_bind ioctl.
>
>should or must?
>

Must, will fix.

>Also I'm not really sure that's a great interface. The batchbuffer really
>only needs to be an address, so maybe all we need is an extension to
>supply an u64 batchbuffer address instead of trying to retrofit this into
>an unfitting current uapi.
>

Yah, this was considered, but was decided to do it as later optimization.
But if we were to remove execlist entries completely (ie., no implicit
sync also), then we need to do this from the beginning.

>And for implicit sync there's two things:
>- for vk I think the right uapi is the dma-buf fence import/export ioctls
>  from Jason Ekstrand. I think we should land that first instead of
>  hacking funny concepts together

I did not understand fully, can you point to it?

>- for gl the dma-buf import/export might not be fast enough, since gl
>  needs to do a _lot_ of implicit sync. There we might need to use the
>  execbuffer buffer list, but then we should have extremely clear uapi
>  rules which disallow _everything_ except setting the explicit sync uapi
>

Ok, so then, we still need to support implicit sync in vm_bind mode. Right?

>Again all this stuff needs to be documented in detail in the kerneldoc
>uapi spec.
>

ok

>> +VM_BIND features include,
>> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +- VA mapping can map to a partial section of the BO (partial binding).
>> +- Support capture of persistent mappings in the dump upon GPU error.
>> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  usecases will be helpful.
>> +- Asynchronous vm_bind and vm_unbind support.
>> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> +  and for signaling batch completion in long running contexts (explained
>> +  below).
>
>This should all be in the kerneldoc.
>

ok

>> +VM_PRIVATE objects
>> +------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>Two things:
>
>- I think the above is required to for initial vm_bind for vk, it kinda
>  doesn't make much sense without that, and will allow us to match amdgpu
>  and radeonsi
>
>- Christian König just landed ttm bulk lru helpers, and I think we need to
>  use those. This means vm_bind will only work with the ttm backend, but
>  that's what we have for the big dgpu where vm_bind helps more in terms
>  of performance, and the igfx conversion to ttm is already going on.
>

ok

>Furthermore the i915 shrinker lru has stopped being an lru, so I think
>that should also be moved over to the ttm lru in some fashion to make sure
>we once again have a reasonable and consistent memory aging and reclaim
>architecture. The current code is just too much of a complete mess.
>
>And since this is all fairly integral to how the code arch works I don't
>think merging a different version which isn't based on ttm bulk lru
>helpers makes sense.
>
>Also I do think the page table lru handling needs to be included here,
>because that's another complete hand-rolled separate world for not much
>good reasons. I guess that can happen in parallel with the initial vm_bind
>bring-up, but it needs to be completed by the time we add the features
>beyond the initial support needed for vk.
>

Ok

>> +VM_BIND locking hirarchy
>> +-------------------------
>> +VM_BIND locking order is as below.
>> +
>> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple pagefault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +
>> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> +   while binding a vma and while updating dma-resv fence list of a BO.
>> +   The private BOs of a VM will all share a dma-resv object.
>> +
>> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> +   call for unbinding and during execbuff path for binding the mapping and
>> +   updating the dma-resv fence list of the BO.
>> +
>> +3) Spinlock/s to protect some of the VM's lists.
>> +
>> +We will also need support for bluk LRU movement of persistent mapping to
>> +avoid additional latencies in execbuff path.
>
>This needs more detail and explanation of how each level is required. Also
>the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>
>Like "some of the VM's lists" explains pretty much nothing.
>

Ok, will explain.

>> +
>> +GPU page faults
>> +----------------
>> +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> +using dma-fence to ensure residency.
>> +In future when GPU page faults are supported, no dma-fence usage is required
>> +as residency is purely managed by installing and removing/invalidating ptes.
>
>This is a bit confusing. I think one part of this should be moved into the
>section with future vm_bind use-cases (we're not going to support page
>faults with legacy softpin or even worse, relocations). The locking
>discussion should be part of the much longer list of uses cases that
>motivate the locking design.
>

Ok, will move.

>> +
>> +
>> +User/Memory Fence
>> +==================
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter.
>> +
>> +User/Memory fence is a <address, value> pair. To signal the user fence,
>> +specified value will be written at the specified virtual address and
>> +wakeup the waiting process. User can wait on an user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal an user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>
>This is 1:1 tied to long running compute mode contexts (which in the uapi
>doc must reference the endless amounts of bikeshed summary we have in the
>docs about indefinite fences).
>

Ok, will check and add reference.

>I'd put this into a new section about compute and userspace memory fences
>support, with this and the next chapter ...

ok

>> +
>> +
>> +VM_BIND use cases
>> +==================
>
>... and then make this section here focus entirely on additional vm_bind
>use-cases that we'll be adding later on. Which doesn't need to go into any
>details, it's just justification for why we want to build the world on top
>of vm_bind.
>

ok

>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence. Compute must opt-in for this mechanism with
>> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> +
>> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> +and implicit dependency setting is not allowed on long running contexts.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context fence (called suspend fence) proxying as
>> +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> +which triggers the context preemption.
>> +
>> +This is much easier to support with VM_BIND compared to the current heavier
>> +execbuff path resource attachment.
>
>There's a bunch of tricky code around compute mode context support, like
>the preempt ctx fence (or suspend fence or whatever you want to call it),
>and the resume work. And I think that code should be shared across
>drivers.
>
>I think the right place to put this is into drm/sched, somewhere attached
>to the drm_sched_entity structure. I expect i915 folks to collaborate with
>amd and ideally also get amdkfd to adopt the same thing if possible. At
>least Christian has mentioned in the past that he's a bit unhappy about
>how this works.
>
>Also drm/sched has dependency tracking, which will be needed to pipeline
>context resume operations. That needs to be used instead of i915-gem
>inventing yet another dependency tracking data structure (it already has 3
>and that's roughly 3 too many).
>
>This means compute mode support and userspace memory fences are blocked on
>the drm/sched conversion, but *eh* add it to the list of reasons for why
>drm/sched needs to happen.
>
>Also since we only have support for compute mode ctx in our internal tree
>with the guc scheduler backend anyway, and the first conversion target is
>the guc backend, I don't think this actually holds up a lot of the code.
>

Hmm...ok. Currently, the context suspend and resume operations in out
internal tree is through an orthogonal guc interface (not through scheduler).
So, I need to look more into this part.

>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>
>This is really just a special case of compute mode contexts, I think I'd
>include that in there, but explain better what it requires (i.e. vm_bind
>not being synchronized against execbuf).
>

ok

>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debuggee) and attached
>> +to GPU via vm_bind interface.
>> +
>> +Mesa/Valkun
>> +------------
>> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> +performance. For Vulkan it should be straightforward to use VM_BIND.
>> +For Iris implicit buffer tracking must be implemented before we can harness
>> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> +overhead becomes more important.
>
>Just to clarify, I don't think we can land vm_bind into upstream if it
>doesn't work 100% for vk. There's a bit much "can" instead of "will in
>this section".
>

ok, will explain better.

>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface.
>
>Userptr is absent here (and it's not the same as svm, at least on
>discrete), and this is needed for the initial version since otherwise vk
>can't use it because we're not at feature parity.
>

userptr gem objects are supported in initial version (and yes it is not
same as SVM). I did not add it here as there is no additional uapi
change required to support that.

>Irc discussions by Maarten and Dave came up with the idea that maybe
>userptr for vm_bind should work _without_ any gem bo as backing storage,
>since that guarantees that people don't come up with funny ideas like
>trying to share such bo across process or mmap it and other nonsense which
>just doesn't work.
>

Hmm...there is no plan to support userptr _without_ gem bo not atleast in
the initial vm_bind support. Is it Ok to put it in the 'futues' section?

>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +usecases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> +  mapped objects. Page table pages are similar to persistent mappings of a
>> +  VM (difference here are that the page table pages will not
>> +  have an i915_vma structure and after swapping pages back in, parent page
>> +  link needs to be updated).
>
>See above, but I think this should be included as part of the initial
>vm_bind push.
>

Ok, as you mentioned above, we can do it soon after initial vm_bind support
lands, but before we add any new vm_bind features.

>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> +  dma-resv fence list to determine if a i915_vma is active or not.
>
>So this is a complete mess, and really should not exist. I think it needs
>to be removed before we try to make i915_vma even more complex by adding
>vm_bind.
>

Hmm...Need to look into this. I am not sure how much of an effort it is going
to be to remove i915_vma active reference tracking and instead use dma_resv
fences for activeness tracking.

>The other thing I've been pondering here is that vm_bind is really
>completely different from legacy vm structures for a lot of reasons:
>- no relocation or softpin handling, which means vm_bind has no reason to
>  ever look at the i915_vma structure in execbuf code. Unfortunately
>  execbuf has been rewritten to be vma instead of obj centric, so it's a
>  100% mismatch
>
>- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>  that because the kernel manages the virtual address space fully. Again
>  ideally that entire vma_move_to_active code and everything related to it
>  would simply not exist.
>
>- similar on the eviction side, the rules are quite different: For vm_bind
>  we never tear down the vma, instead it's just moved to the list of
>  evicted vma. Legacy vm have no need for all these additional lists, so
>  another huge confusion.
>
>- if the refcount is done correctly for vm_bind we wouldn't need the
>  tricky code in the bo close paths. Unfortunately legacy vm with
>  relocations and softpin require that vma are only a weak reference, so
>  that cannot be removed.
>
>- there's also a ton of special cases for ggtt handling, like the
>  different views (for display, partial views for mmap), but also the
>  gen2/3 alignment and padding requirements which vm_bind never needs.
>
>I think the right thing here is to massively split the implementation
>behind some solid vm/vma abstraction, with a base clase for vm and vma
>which _only_ has the pieces which both vm_bind and the legacy vm stuff
>needs. But it's a bit tricky to get there. I think a workable path would
>be:
>- Add a new base class to both i915_address_space and i915_vma, which
>  starts out empty.
>
>- As vm_bind code lands, move things that vm_bind code needs into these
>  base classes
>

Ok

>- The goal should be that these base classes are a stand-alone library
>  that other drivers could reuse. Like we've done with the buddy
>  allocator, which first moved from i915-gem to i915-ttm, and which amd
>  now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>  interested in adding something like vm_bind should be involved from the
>  start (or maybe the entire thing reused in amdgpu, they're looking at
>  vk sparse binding support too or at least have perf issues I think).
>
>- Locking must be the same across all implemntations, otherwise it's
>  really not an abstract. i915 screwed this up terribly by having
>  different locking rules for ppgtt and ggtt, which is just nonsense.
>
>- The legacy specific code needs to be extracted as much as possible and
>  shoved into separate files. In execbuf this means we need to get back to
>  object centric flow, and the slowpaths need to become a lot simpler
>  again (Maarten has cleaned up some of this, but there's still a silly
>  amount of hacks in there with funny layering).
>

This also, we can do soon after vm_bind code lands right?

>- I think if stuff like the vma eviction details (list movement and
>  locking and refcounting of the underlying object)
>
>> +
>> +These can be worked upon after intitial vm_bind support is added.
>
>I don't think that works, given how badly i915-gem team screwed up in
>other places. And those places had to be fixed by adopting shared code
>like ttm. Plus there's already a huge unfulffiled promise pending with the
>drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>

Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
active reference tracking code from i915 driver. Wonder if there is any
middle ground here like not using that in vm_bind mode?

Niranjana

>Cheers, Daniel
>
>> +
>> +
>> +UAPI
>> +=====
>> +Uapi definiton can be found here:
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>
>
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-31 11:37       ` [Intel-gfx] " Daniel Vetter
@ 2022-04-20 22:50         ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-20 22:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: chris.p.wilson, dri-devel, Dave Airlie, intel-gfx, Bloomfield,
	Jon, Ben Skeggs, Jason Ekstrand, daniel.vetter, thomas.hellstrom,
	Christian König

On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
>One thing I've forgotten, since it's only hinted at here: If/when we
>switch tlb flushing from the current dumb&synchronous implementation
>we now have in i915 in upstream to one with batching using dma_fence,
>then I think that should be something which is done with a small
>helper library of shared code too. The batching is somewhat tricky,
>and you need to make sure you put the fence into the right
>dma_resv_usage slot, and the trick with replace the vm fence with a
>tlb flush fence is also a good reason to share the code so we only
>have it one.
>
>Christian's recent work also has some prep work for this already with
>the fence replacing trick.

Sure, but this optimization is not required for initial vm_bind support
to land right? We can look at it soon after that. Is that ok?
I have made a reference to this TLB flush batching work in the rst file.

Niranjana

>-Daniel
>
>On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@ffwll.ch> wrote:
>> Adding a pile of people who've expressed interest in vm_bind for their
>> drivers.
>>
>> Also note to the intel folks: This is largely written with me having my
>> subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>> here for the subsystem at large. There is substantial rework involved
>> here, but it's not any different from i915 adopting ttm or i915 adpoting
>> drm/sched, and I do think this stuff needs to happen in one form or
>> another.
>>
>> On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>> > VM_BIND design document with description of intended use cases.
>> >
>> > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> > ---
>> >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>> >  Documentation/gpu/rfc/index.rst        |   4 +
>> >  2 files changed, 214 insertions(+)
>> >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>> >
>> > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> > new file mode 100644
>> > index 000000000000..cdc6bb25b942
>> > --- /dev/null
>> > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> > @@ -0,0 +1,210 @@
>> > +==========================================
>> > +I915 VM_BIND feature design and use cases
>> > +==========================================
>> > +
>> > +VM_BIND feature
>> > +================
>> > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> > +a specified address space (VM).
>> > +
>> > +These mappings (also referred to as persistent mappings) will be persistent
>> > +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> > +having to provide a list of all required mappings during each submission
>> > +(as required by older execbuff mode).
>> > +
>> > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> > +flag is set (useful if mapping is required for an active context).
>>
>> So this is a screw-up I've done, and for upstream I think we need to fix
>> it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>> I was wrong suggesting we should do this a few years back when we kicked
>> this off internally :-(
>>
>> What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>> things on top:
>> - in and out fences, like with execbuf, to allow userspace to sync with
>>   execbuf as needed
>> - for compute-mode context this means userspace memory fences
>> - for legacy context this means a timeline syncobj in drm_syncobj
>>
>> No sync_file or anything else like this at all. This means a bunch of
>> work, but also it'll have benefits because it means we should be able to
>> use exactly the same code paths and logic for both compute and for legacy
>> context, because drm_syncobj support future fence semantics.
>>
>> Also on the implementation side we still need to install dma_fence to the
>> various dma_resv, and for this we need the new dma_resv_usage series from
>> Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>> flag to make sure they never result in an oversync issue with execbuf. I
>> don't think trying to land vm_bind without that prep work in
>> dma_resv_usage makes sense.
>>
>> Also as soon as dma_resv_usage has landed there's a few cleanups we should
>> do in i915:
>> - ttm bo moving code should probably simplify a bit (and maybe more of the
>>   code should be pushed as helpers into ttm)
>> - clflush code should be moved over to using USAGE_KERNEL and the various
>>   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>>   expand on the kernel-doc for cache_dirty") for a bit more context
>>
>> This is still not yet enough, since if a vm_bind races with an eviction we
>> might stall on the new buffers being readied first before the context can
>> continue. This needs some care to make sure that vma which aren't fully
>> bound yet are on a separate list, and vma which are marked for unbinding
>> are removed from the main working set list as soon as possible.
>>
>> All of these things are relevant for the uapi semantics, which means
>> - they need to be documented in the uapi kerneldoc, ideally with example
>>   flows
>> - umd need to ack this
>>
>> The other thing here is the async/nonblocking path. I think we still need
>> that one, but again it should not sync with anything going on in execbuf,
>> but simply execute the ioctl code in a kernel thread. The idea here is
>> that this works like a special gpu engine, so that compute and vk can
>> schedule bindings interleaved with rendering. This should be enough to get
>> a performant vk sparse binding/textures implementation.
>>
>> But I'm not entirely sure on this one, so this definitely needs acks from
>> umds.
>>
>> > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> > +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> > +
>> > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> > +but those BOs should be mapped ahead via vm_bind ioctl.
>>
>> should or must?
>>
>> Also I'm not really sure that's a great interface. The batchbuffer really
>> only needs to be an address, so maybe all we need is an extension to
>> supply an u64 batchbuffer address instead of trying to retrofit this into
>> an unfitting current uapi.
>>
>> And for implicit sync there's two things:
>> - for vk I think the right uapi is the dma-buf fence import/export ioctls
>>   from Jason Ekstrand. I think we should land that first instead of
>>   hacking funny concepts together
>> - for gl the dma-buf import/export might not be fast enough, since gl
>>   needs to do a _lot_ of implicit sync. There we might need to use the
>>   execbuffer buffer list, but then we should have extremely clear uapi
>>   rules which disallow _everything_ except setting the explicit sync uapi
>>
>> Again all this stuff needs to be documented in detail in the kerneldoc
>> uapi spec.
>>
>> > +VM_BIND features include,
>> > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> > +  of an object (aliasing).
>> > +- VA mapping can map to a partial section of the BO (partial binding).
>> > +- Support capture of persistent mappings in the dump upon GPU error.
>> > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> > +  usecases will be helpful.
>> > +- Asynchronous vm_bind and vm_unbind support.
>> > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> > +  and for signaling batch completion in long running contexts (explained
>> > +  below).
>>
>> This should all be in the kerneldoc.
>>
>> > +VM_PRIVATE objects
>> > +------------------
>> > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> > +exported. Hence these BOs are referred to as Shared BOs.
>> > +During each execbuff submission, the request fence must be added to the
>> > +dma-resv fence list of all shared BOs mapped on the VM.
>> > +
>> > +VM_BIND feature introduces an optimization where user can create BO which
>> > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> > +the VM they are private to and can't be dma-buf exported.
>> > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> > +submission, they need only one dma-resv fence list updated. Thus the fast
>> > +path (where required mappings are already bound) submission latency is O(1)
>> > +w.r.t the number of VM private BOs.
>>
>> Two things:
>>
>> - I think the above is required to for initial vm_bind for vk, it kinda
>>   doesn't make much sense without that, and will allow us to match amdgpu
>>   and radeonsi
>>
>> - Christian König just landed ttm bulk lru helpers, and I think we need to
>>   use those. This means vm_bind will only work with the ttm backend, but
>>   that's what we have for the big dgpu where vm_bind helps more in terms
>>   of performance, and the igfx conversion to ttm is already going on.
>>
>> Furthermore the i915 shrinker lru has stopped being an lru, so I think
>> that should also be moved over to the ttm lru in some fashion to make sure
>> we once again have a reasonable and consistent memory aging and reclaim
>> architecture. The current code is just too much of a complete mess.
>>
>> And since this is all fairly integral to how the code arch works I don't
>> think merging a different version which isn't based on ttm bulk lru
>> helpers makes sense.
>>
>> Also I do think the page table lru handling needs to be included here,
>> because that's another complete hand-rolled separate world for not much
>> good reasons. I guess that can happen in parallel with the initial vm_bind
>> bring-up, but it needs to be completed by the time we add the features
>> beyond the initial support needed for vk.
>>
>> > +VM_BIND locking hirarchy
>> > +-------------------------
>> > +VM_BIND locking order is as below.
>> > +
>> > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> > +
>> > +   In future, when GPU page faults are supported, we can potentially use a
>> > +   rwsem instead, so that multiple pagefault handlers can take the read side
>> > +   lock to lookup the mapping and hence can run in parallel.
>> > +
>> > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> > +   while binding a vma and while updating dma-resv fence list of a BO.
>> > +   The private BOs of a VM will all share a dma-resv object.
>> > +
>> > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> > +   call for unbinding and during execbuff path for binding the mapping and
>> > +   updating the dma-resv fence list of the BO.
>> > +
>> > +3) Spinlock/s to protect some of the VM's lists.
>> > +
>> > +We will also need support for bluk LRU movement of persistent mapping to
>> > +avoid additional latencies in execbuff path.
>>
>> This needs more detail and explanation of how each level is required. Also
>> the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>
>> Like "some of the VM's lists" explains pretty much nothing.
>>
>> > +
>> > +GPU page faults
>> > +----------------
>> > +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> > +using dma-fence to ensure residency.
>> > +In future when GPU page faults are supported, no dma-fence usage is required
>> > +as residency is purely managed by installing and removing/invalidating ptes.
>>
>> This is a bit confusing. I think one part of this should be moved into the
>> section with future vm_bind use-cases (we're not going to support page
>> faults with legacy softpin or even worse, relocations). The locking
>> discussion should be part of the much longer list of uses cases that
>> motivate the locking design.
>>
>> > +
>> > +
>> > +User/Memory Fence
>> > +==================
>> > +The idea is to take a user specified virtual address and install an interrupt
>> > +handler to wake up the current task when the memory location passes the user
>> > +supplied filter.
>> > +
>> > +User/Memory fence is a <address, value> pair. To signal the user fence,
>> > +specified value will be written at the specified virtual address and
>> > +wakeup the waiting process. User can wait on an user fence with the
>> > +gem_wait_user_fence ioctl.
>> > +
>> > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> > +interrupt within their batches after updating the value to have sub-batch
>> > +precision on the wakeup. Each batch can signal an user fence to indicate
>> > +the completion of next level batch. The completion of very first level batch
>> > +needs to be signaled by the command streamer. The user must provide the
>> > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> > +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> > +signal it.
>> > +
>> > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> > +the user process after completion of an asynchronous operation.
>> > +
>> > +When VM_BIND ioctl was provided with a user/memory fence via the
>> > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> > +of binding of that mapping. All async binds/unbinds are serialized, hence
>> > +signaling of user/memory fence also indicate the completion of all previous
>> > +binds/unbinds.
>> > +
>> > +This feature will be derived from the below original work:
>> > +https://patchwork.freedesktop.org/patch/349417/
>>
>> This is 1:1 tied to long running compute mode contexts (which in the uapi
>> doc must reference the endless amounts of bikeshed summary we have in the
>> docs about indefinite fences).
>>
>> I'd put this into a new section about compute and userspace memory fences
>> support, with this and the next chapter ...
>> > +
>> > +
>> > +VM_BIND use cases
>> > +==================
>>
>> ... and then make this section here focus entirely on additional vm_bind
>> use-cases that we'll be adding later on. Which doesn't need to go into any
>> details, it's just justification for why we want to build the world on top
>> of vm_bind.
>>
>> > +
>> > +Long running Compute contexts
>> > +------------------------------
>> > +Usage of dma-fence expects that they complete in reasonable amount of time.
>> > +Compute on the other hand can be long running. Hence it is appropriate for
>> > +compute to use user/memory fence and dma-fence usage will be limited to
>> > +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> > +in user fence. Compute must opt-in for this mechanism with
>> > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> > +
>> > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> > +and implicit dependency setting is not allowed on long running contexts.
>> > +
>> > +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> > +will initiate a suspend (preemption) of long running context with a dma-fence
>> > +attached to it. And upon completion of that suspend fence, finish the
>> > +invalidation, revalidate the BO and then resume the compute context. This is
>> > +done by having a per-context fence (called suspend fence) proxying as
>> > +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> > +which triggers the context preemption.
>> > +
>> > +This is much easier to support with VM_BIND compared to the current heavier
>> > +execbuff path resource attachment.
>>
>> There's a bunch of tricky code around compute mode context support, like
>> the preempt ctx fence (or suspend fence or whatever you want to call it),
>> and the resume work. And I think that code should be shared across
>> drivers.
>>
>> I think the right place to put this is into drm/sched, somewhere attached
>> to the drm_sched_entity structure. I expect i915 folks to collaborate with
>> amd and ideally also get amdkfd to adopt the same thing if possible. At
>> least Christian has mentioned in the past that he's a bit unhappy about
>> how this works.
>>
>> Also drm/sched has dependency tracking, which will be needed to pipeline
>> context resume operations. That needs to be used instead of i915-gem
>> inventing yet another dependency tracking data structure (it already has 3
>> and that's roughly 3 too many).
>>
>> This means compute mode support and userspace memory fences are blocked on
>> the drm/sched conversion, but *eh* add it to the list of reasons for why
>> drm/sched needs to happen.
>>
>> Also since we only have support for compute mode ctx in our internal tree
>> with the guc scheduler backend anyway, and the first conversion target is
>> the guc backend, I don't think this actually holds up a lot of the code.
>>
>> > +Low Latency Submission
>> > +-----------------------
>> > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>
>> This is really just a special case of compute mode contexts, I think I'd
>> include that in there, but explain better what it requires (i.e. vm_bind
>> not being synchronized against execbuf).
>>
>> > +
>> > +Debugger
>> > +---------
>> > +With debug event interface user space process (debugger) is able to keep track
>> > +of and act upon resources created by another process (debuggee) and attached
>> > +to GPU via vm_bind interface.
>> > +
>> > +Mesa/Valkun
>> > +------------
>> > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> > +performance. For Vulkan it should be straightforward to use VM_BIND.
>> > +For Iris implicit buffer tracking must be implemented before we can harness
>> > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> > +overhead becomes more important.
>>
>> Just to clarify, I don't think we can land vm_bind into upstream if it
>> doesn't work 100% for vk. There's a bit much "can" instead of "will in
>> this section".
>>
>> > +
>> > +Page level hints settings
>> > +--------------------------
>> > +VM_BIND allows any hints setting per mapping instead of per BO.
>> > +Possible hints include read-only, placement and atomicity.
>> > +Sub-BO level placement hint will be even more relevant with
>> > +upcoming GPU on-demand page fault support.
>> > +
>> > +Page level Cache/CLOS settings
>> > +-------------------------------
>> > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> > +
>> > +Shared Virtual Memory (SVM) support
>> > +------------------------------------
>> > +VM_BIND interface can be used to map system memory directly (without gem BO
>> > +abstraction) using the HMM interface.
>>
>> Userptr is absent here (and it's not the same as svm, at least on
>> discrete), and this is needed for the initial version since otherwise vk
>> can't use it because we're not at feature parity.
>>
>> Irc discussions by Maarten and Dave came up with the idea that maybe
>> userptr for vm_bind should work _without_ any gem bo as backing storage,
>> since that guarantees that people don't come up with funny ideas like
>> trying to share such bo across process or mmap it and other nonsense which
>> just doesn't work.
>>
>> > +
>> > +
>> > +Broder i915 cleanups
>> > +=====================
>> > +Supporting this whole new vm_bind mode of binding which comes with its own
>> > +usecases to support and the locking requirements requires proper integration
>> > +with the existing i915 driver. This calls for some broader i915 driver
>> > +cleanups/simplifications for maintainability of the driver going forward.
>> > +Here are few things identified and are being looked into.
>> > +
>> > +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> > +  mapped objects. Page table pages are similar to persistent mappings of a
>> > +  VM (difference here are that the page table pages will not
>> > +  have an i915_vma structure and after swapping pages back in, parent page
>> > +  link needs to be updated).
>>
>> See above, but I think this should be included as part of the initial
>> vm_bind push.
>>
>> > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> > +  do not use it and complexity it brings in is probably more than the
>> > +  performance advantage we get in legacy execbuff case.
>> > +- Remove vma->open_count counting
>> > +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> > +  dma-resv fence list to determine if a i915_vma is active or not.
>>
>> So this is a complete mess, and really should not exist. I think it needs
>> to be removed before we try to make i915_vma even more complex by adding
>> vm_bind.
>>
>> The other thing I've been pondering here is that vm_bind is really
>> completely different from legacy vm structures for a lot of reasons:
>> - no relocation or softpin handling, which means vm_bind has no reason to
>>   ever look at the i915_vma structure in execbuf code. Unfortunately
>>   execbuf has been rewritten to be vma instead of obj centric, so it's a
>>   100% mismatch
>>
>> - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>>   that because the kernel manages the virtual address space fully. Again
>>   ideally that entire vma_move_to_active code and everything related to it
>>   would simply not exist.
>>
>> - similar on the eviction side, the rules are quite different: For vm_bind
>>   we never tear down the vma, instead it's just moved to the list of
>>   evicted vma. Legacy vm have no need for all these additional lists, so
>>   another huge confusion.
>>
>> - if the refcount is done correctly for vm_bind we wouldn't need the
>>   tricky code in the bo close paths. Unfortunately legacy vm with
>>   relocations and softpin require that vma are only a weak reference, so
>>   that cannot be removed.
>>
>> - there's also a ton of special cases for ggtt handling, like the
>>   different views (for display, partial views for mmap), but also the
>>   gen2/3 alignment and padding requirements which vm_bind never needs.
>>
>> I think the right thing here is to massively split the implementation
>> behind some solid vm/vma abstraction, with a base clase for vm and vma
>> which _only_ has the pieces which both vm_bind and the legacy vm stuff
>> needs. But it's a bit tricky to get there. I think a workable path would
>> be:
>> - Add a new base class to both i915_address_space and i915_vma, which
>>   starts out empty.
>>
>> - As vm_bind code lands, move things that vm_bind code needs into these
>>   base classes
>>
>> - The goal should be that these base classes are a stand-alone library
>>   that other drivers could reuse. Like we've done with the buddy
>>   allocator, which first moved from i915-gem to i915-ttm, and which amd
>>   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>>   interested in adding something like vm_bind should be involved from the
>>   start (or maybe the entire thing reused in amdgpu, they're looking at
>>   vk sparse binding support too or at least have perf issues I think).
>>
>> - Locking must be the same across all implemntations, otherwise it's
>>   really not an abstract. i915 screwed this up terribly by having
>>   different locking rules for ppgtt and ggtt, which is just nonsense.
>>
>> - The legacy specific code needs to be extracted as much as possible and
>>   shoved into separate files. In execbuf this means we need to get back to
>>   object centric flow, and the slowpaths need to become a lot simpler
>>   again (Maarten has cleaned up some of this, but there's still a silly
>>   amount of hacks in there with funny layering).
>>
>> - I think if stuff like the vma eviction details (list movement and
>>   locking and refcounting of the underlying object)
>>
>> > +
>> > +These can be worked upon after intitial vm_bind support is added.
>>
>> I don't think that works, given how badly i915-gem team screwed up in
>> other places. And those places had to be fixed by adopting shared code
>> like ttm. Plus there's already a huge unfulffiled promise pending with the
>> drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>
>> Cheers, Daniel
>>
>> > +
>> > +
>> > +UAPI
>> > +=====
>> > +Uapi definiton can be found here:
>> > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> > index 91e93a705230..7d10c36b268d 100644
>> > --- a/Documentation/gpu/rfc/index.rst
>> > +++ b/Documentation/gpu/rfc/index.rst
>> > @@ -23,3 +23,7 @@ host such documentation:
>> >  .. toctree::
>> >
>> >      i915_scheduler.rst
>> > +
>> > +.. toctree::
>> > +
>> > +    i915_vm_bind.rst
>> > --
>> > 2.21.0.rc0.32.g243a4c7e27
>> >
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>
>
>
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-04-20 22:50         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-20 22:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: chris.p.wilson, dri-devel, Dave Airlie, intel-gfx, Daniel Stone,
	Ben Skeggs, daniel.vetter, thomas.hellstrom,
	Christian König

On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
>One thing I've forgotten, since it's only hinted at here: If/when we
>switch tlb flushing from the current dumb&synchronous implementation
>we now have in i915 in upstream to one with batching using dma_fence,
>then I think that should be something which is done with a small
>helper library of shared code too. The batching is somewhat tricky,
>and you need to make sure you put the fence into the right
>dma_resv_usage slot, and the trick with replace the vm fence with a
>tlb flush fence is also a good reason to share the code so we only
>have it one.
>
>Christian's recent work also has some prep work for this already with
>the fence replacing trick.

Sure, but this optimization is not required for initial vm_bind support
to land right? We can look at it soon after that. Is that ok?
I have made a reference to this TLB flush batching work in the rst file.

Niranjana

>-Daniel
>
>On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@ffwll.ch> wrote:
>> Adding a pile of people who've expressed interest in vm_bind for their
>> drivers.
>>
>> Also note to the intel folks: This is largely written with me having my
>> subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>> here for the subsystem at large. There is substantial rework involved
>> here, but it's not any different from i915 adopting ttm or i915 adpoting
>> drm/sched, and I do think this stuff needs to happen in one form or
>> another.
>>
>> On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>> > VM_BIND design document with description of intended use cases.
>> >
>> > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> > ---
>> >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>> >  Documentation/gpu/rfc/index.rst        |   4 +
>> >  2 files changed, 214 insertions(+)
>> >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>> >
>> > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> > new file mode 100644
>> > index 000000000000..cdc6bb25b942
>> > --- /dev/null
>> > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> > @@ -0,0 +1,210 @@
>> > +==========================================
>> > +I915 VM_BIND feature design and use cases
>> > +==========================================
>> > +
>> > +VM_BIND feature
>> > +================
>> > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> > +a specified address space (VM).
>> > +
>> > +These mappings (also referred to as persistent mappings) will be persistent
>> > +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> > +having to provide a list of all required mappings during each submission
>> > +(as required by older execbuff mode).
>> > +
>> > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> > +flag is set (useful if mapping is required for an active context).
>>
>> So this is a screw-up I've done, and for upstream I think we need to fix
>> it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>> I was wrong suggesting we should do this a few years back when we kicked
>> this off internally :-(
>>
>> What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>> things on top:
>> - in and out fences, like with execbuf, to allow userspace to sync with
>>   execbuf as needed
>> - for compute-mode context this means userspace memory fences
>> - for legacy context this means a timeline syncobj in drm_syncobj
>>
>> No sync_file or anything else like this at all. This means a bunch of
>> work, but also it'll have benefits because it means we should be able to
>> use exactly the same code paths and logic for both compute and for legacy
>> context, because drm_syncobj support future fence semantics.
>>
>> Also on the implementation side we still need to install dma_fence to the
>> various dma_resv, and for this we need the new dma_resv_usage series from
>> Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>> flag to make sure they never result in an oversync issue with execbuf. I
>> don't think trying to land vm_bind without that prep work in
>> dma_resv_usage makes sense.
>>
>> Also as soon as dma_resv_usage has landed there's a few cleanups we should
>> do in i915:
>> - ttm bo moving code should probably simplify a bit (and maybe more of the
>>   code should be pushed as helpers into ttm)
>> - clflush code should be moved over to using USAGE_KERNEL and the various
>>   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>>   expand on the kernel-doc for cache_dirty") for a bit more context
>>
>> This is still not yet enough, since if a vm_bind races with an eviction we
>> might stall on the new buffers being readied first before the context can
>> continue. This needs some care to make sure that vma which aren't fully
>> bound yet are on a separate list, and vma which are marked for unbinding
>> are removed from the main working set list as soon as possible.
>>
>> All of these things are relevant for the uapi semantics, which means
>> - they need to be documented in the uapi kerneldoc, ideally with example
>>   flows
>> - umd need to ack this
>>
>> The other thing here is the async/nonblocking path. I think we still need
>> that one, but again it should not sync with anything going on in execbuf,
>> but simply execute the ioctl code in a kernel thread. The idea here is
>> that this works like a special gpu engine, so that compute and vk can
>> schedule bindings interleaved with rendering. This should be enough to get
>> a performant vk sparse binding/textures implementation.
>>
>> But I'm not entirely sure on this one, so this definitely needs acks from
>> umds.
>>
>> > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> > +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> > +
>> > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> > +but those BOs should be mapped ahead via vm_bind ioctl.
>>
>> should or must?
>>
>> Also I'm not really sure that's a great interface. The batchbuffer really
>> only needs to be an address, so maybe all we need is an extension to
>> supply an u64 batchbuffer address instead of trying to retrofit this into
>> an unfitting current uapi.
>>
>> And for implicit sync there's two things:
>> - for vk I think the right uapi is the dma-buf fence import/export ioctls
>>   from Jason Ekstrand. I think we should land that first instead of
>>   hacking funny concepts together
>> - for gl the dma-buf import/export might not be fast enough, since gl
>>   needs to do a _lot_ of implicit sync. There we might need to use the
>>   execbuffer buffer list, but then we should have extremely clear uapi
>>   rules which disallow _everything_ except setting the explicit sync uapi
>>
>> Again all this stuff needs to be documented in detail in the kerneldoc
>> uapi spec.
>>
>> > +VM_BIND features include,
>> > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> > +  of an object (aliasing).
>> > +- VA mapping can map to a partial section of the BO (partial binding).
>> > +- Support capture of persistent mappings in the dump upon GPU error.
>> > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> > +  usecases will be helpful.
>> > +- Asynchronous vm_bind and vm_unbind support.
>> > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> > +  and for signaling batch completion in long running contexts (explained
>> > +  below).
>>
>> This should all be in the kerneldoc.
>>
>> > +VM_PRIVATE objects
>> > +------------------
>> > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> > +exported. Hence these BOs are referred to as Shared BOs.
>> > +During each execbuff submission, the request fence must be added to the
>> > +dma-resv fence list of all shared BOs mapped on the VM.
>> > +
>> > +VM_BIND feature introduces an optimization where user can create BO which
>> > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> > +the VM they are private to and can't be dma-buf exported.
>> > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> > +submission, they need only one dma-resv fence list updated. Thus the fast
>> > +path (where required mappings are already bound) submission latency is O(1)
>> > +w.r.t the number of VM private BOs.
>>
>> Two things:
>>
>> - I think the above is required to for initial vm_bind for vk, it kinda
>>   doesn't make much sense without that, and will allow us to match amdgpu
>>   and radeonsi
>>
>> - Christian König just landed ttm bulk lru helpers, and I think we need to
>>   use those. This means vm_bind will only work with the ttm backend, but
>>   that's what we have for the big dgpu where vm_bind helps more in terms
>>   of performance, and the igfx conversion to ttm is already going on.
>>
>> Furthermore the i915 shrinker lru has stopped being an lru, so I think
>> that should also be moved over to the ttm lru in some fashion to make sure
>> we once again have a reasonable and consistent memory aging and reclaim
>> architecture. The current code is just too much of a complete mess.
>>
>> And since this is all fairly integral to how the code arch works I don't
>> think merging a different version which isn't based on ttm bulk lru
>> helpers makes sense.
>>
>> Also I do think the page table lru handling needs to be included here,
>> because that's another complete hand-rolled separate world for not much
>> good reasons. I guess that can happen in parallel with the initial vm_bind
>> bring-up, but it needs to be completed by the time we add the features
>> beyond the initial support needed for vk.
>>
>> > +VM_BIND locking hirarchy
>> > +-------------------------
>> > +VM_BIND locking order is as below.
>> > +
>> > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> > +
>> > +   In future, when GPU page faults are supported, we can potentially use a
>> > +   rwsem instead, so that multiple pagefault handlers can take the read side
>> > +   lock to lookup the mapping and hence can run in parallel.
>> > +
>> > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> > +   while binding a vma and while updating dma-resv fence list of a BO.
>> > +   The private BOs of a VM will all share a dma-resv object.
>> > +
>> > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> > +   call for unbinding and during execbuff path for binding the mapping and
>> > +   updating the dma-resv fence list of the BO.
>> > +
>> > +3) Spinlock/s to protect some of the VM's lists.
>> > +
>> > +We will also need support for bluk LRU movement of persistent mapping to
>> > +avoid additional latencies in execbuff path.
>>
>> This needs more detail and explanation of how each level is required. Also
>> the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>
>> Like "some of the VM's lists" explains pretty much nothing.
>>
>> > +
>> > +GPU page faults
>> > +----------------
>> > +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> > +using dma-fence to ensure residency.
>> > +In future when GPU page faults are supported, no dma-fence usage is required
>> > +as residency is purely managed by installing and removing/invalidating ptes.
>>
>> This is a bit confusing. I think one part of this should be moved into the
>> section with future vm_bind use-cases (we're not going to support page
>> faults with legacy softpin or even worse, relocations). The locking
>> discussion should be part of the much longer list of uses cases that
>> motivate the locking design.
>>
>> > +
>> > +
>> > +User/Memory Fence
>> > +==================
>> > +The idea is to take a user specified virtual address and install an interrupt
>> > +handler to wake up the current task when the memory location passes the user
>> > +supplied filter.
>> > +
>> > +User/Memory fence is a <address, value> pair. To signal the user fence,
>> > +specified value will be written at the specified virtual address and
>> > +wakeup the waiting process. User can wait on an user fence with the
>> > +gem_wait_user_fence ioctl.
>> > +
>> > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> > +interrupt within their batches after updating the value to have sub-batch
>> > +precision on the wakeup. Each batch can signal an user fence to indicate
>> > +the completion of next level batch. The completion of very first level batch
>> > +needs to be signaled by the command streamer. The user must provide the
>> > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> > +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> > +signal it.
>> > +
>> > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> > +the user process after completion of an asynchronous operation.
>> > +
>> > +When VM_BIND ioctl was provided with a user/memory fence via the
>> > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> > +of binding of that mapping. All async binds/unbinds are serialized, hence
>> > +signaling of user/memory fence also indicate the completion of all previous
>> > +binds/unbinds.
>> > +
>> > +This feature will be derived from the below original work:
>> > +https://patchwork.freedesktop.org/patch/349417/
>>
>> This is 1:1 tied to long running compute mode contexts (which in the uapi
>> doc must reference the endless amounts of bikeshed summary we have in the
>> docs about indefinite fences).
>>
>> I'd put this into a new section about compute and userspace memory fences
>> support, with this and the next chapter ...
>> > +
>> > +
>> > +VM_BIND use cases
>> > +==================
>>
>> ... and then make this section here focus entirely on additional vm_bind
>> use-cases that we'll be adding later on. Which doesn't need to go into any
>> details, it's just justification for why we want to build the world on top
>> of vm_bind.
>>
>> > +
>> > +Long running Compute contexts
>> > +------------------------------
>> > +Usage of dma-fence expects that they complete in reasonable amount of time.
>> > +Compute on the other hand can be long running. Hence it is appropriate for
>> > +compute to use user/memory fence and dma-fence usage will be limited to
>> > +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> > +in user fence. Compute must opt-in for this mechanism with
>> > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> > +
>> > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> > +and implicit dependency setting is not allowed on long running contexts.
>> > +
>> > +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> > +will initiate a suspend (preemption) of long running context with a dma-fence
>> > +attached to it. And upon completion of that suspend fence, finish the
>> > +invalidation, revalidate the BO and then resume the compute context. This is
>> > +done by having a per-context fence (called suspend fence) proxying as
>> > +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> > +which triggers the context preemption.
>> > +
>> > +This is much easier to support with VM_BIND compared to the current heavier
>> > +execbuff path resource attachment.
>>
>> There's a bunch of tricky code around compute mode context support, like
>> the preempt ctx fence (or suspend fence or whatever you want to call it),
>> and the resume work. And I think that code should be shared across
>> drivers.
>>
>> I think the right place to put this is into drm/sched, somewhere attached
>> to the drm_sched_entity structure. I expect i915 folks to collaborate with
>> amd and ideally also get amdkfd to adopt the same thing if possible. At
>> least Christian has mentioned in the past that he's a bit unhappy about
>> how this works.
>>
>> Also drm/sched has dependency tracking, which will be needed to pipeline
>> context resume operations. That needs to be used instead of i915-gem
>> inventing yet another dependency tracking data structure (it already has 3
>> and that's roughly 3 too many).
>>
>> This means compute mode support and userspace memory fences are blocked on
>> the drm/sched conversion, but *eh* add it to the list of reasons for why
>> drm/sched needs to happen.
>>
>> Also since we only have support for compute mode ctx in our internal tree
>> with the guc scheduler backend anyway, and the first conversion target is
>> the guc backend, I don't think this actually holds up a lot of the code.
>>
>> > +Low Latency Submission
>> > +-----------------------
>> > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>
>> This is really just a special case of compute mode contexts, I think I'd
>> include that in there, but explain better what it requires (i.e. vm_bind
>> not being synchronized against execbuf).
>>
>> > +
>> > +Debugger
>> > +---------
>> > +With debug event interface user space process (debugger) is able to keep track
>> > +of and act upon resources created by another process (debuggee) and attached
>> > +to GPU via vm_bind interface.
>> > +
>> > +Mesa/Valkun
>> > +------------
>> > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> > +performance. For Vulkan it should be straightforward to use VM_BIND.
>> > +For Iris implicit buffer tracking must be implemented before we can harness
>> > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> > +overhead becomes more important.
>>
>> Just to clarify, I don't think we can land vm_bind into upstream if it
>> doesn't work 100% for vk. There's a bit much "can" instead of "will in
>> this section".
>>
>> > +
>> > +Page level hints settings
>> > +--------------------------
>> > +VM_BIND allows any hints setting per mapping instead of per BO.
>> > +Possible hints include read-only, placement and atomicity.
>> > +Sub-BO level placement hint will be even more relevant with
>> > +upcoming GPU on-demand page fault support.
>> > +
>> > +Page level Cache/CLOS settings
>> > +-------------------------------
>> > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> > +
>> > +Shared Virtual Memory (SVM) support
>> > +------------------------------------
>> > +VM_BIND interface can be used to map system memory directly (without gem BO
>> > +abstraction) using the HMM interface.
>>
>> Userptr is absent here (and it's not the same as svm, at least on
>> discrete), and this is needed for the initial version since otherwise vk
>> can't use it because we're not at feature parity.
>>
>> Irc discussions by Maarten and Dave came up with the idea that maybe
>> userptr for vm_bind should work _without_ any gem bo as backing storage,
>> since that guarantees that people don't come up with funny ideas like
>> trying to share such bo across process or mmap it and other nonsense which
>> just doesn't work.
>>
>> > +
>> > +
>> > +Broder i915 cleanups
>> > +=====================
>> > +Supporting this whole new vm_bind mode of binding which comes with its own
>> > +usecases to support and the locking requirements requires proper integration
>> > +with the existing i915 driver. This calls for some broader i915 driver
>> > +cleanups/simplifications for maintainability of the driver going forward.
>> > +Here are few things identified and are being looked into.
>> > +
>> > +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> > +  mapped objects. Page table pages are similar to persistent mappings of a
>> > +  VM (difference here are that the page table pages will not
>> > +  have an i915_vma structure and after swapping pages back in, parent page
>> > +  link needs to be updated).
>>
>> See above, but I think this should be included as part of the initial
>> vm_bind push.
>>
>> > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> > +  do not use it and complexity it brings in is probably more than the
>> > +  performance advantage we get in legacy execbuff case.
>> > +- Remove vma->open_count counting
>> > +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> > +  dma-resv fence list to determine if a i915_vma is active or not.
>>
>> So this is a complete mess, and really should not exist. I think it needs
>> to be removed before we try to make i915_vma even more complex by adding
>> vm_bind.
>>
>> The other thing I've been pondering here is that vm_bind is really
>> completely different from legacy vm structures for a lot of reasons:
>> - no relocation or softpin handling, which means vm_bind has no reason to
>>   ever look at the i915_vma structure in execbuf code. Unfortunately
>>   execbuf has been rewritten to be vma instead of obj centric, so it's a
>>   100% mismatch
>>
>> - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>>   that because the kernel manages the virtual address space fully. Again
>>   ideally that entire vma_move_to_active code and everything related to it
>>   would simply not exist.
>>
>> - similar on the eviction side, the rules are quite different: For vm_bind
>>   we never tear down the vma, instead it's just moved to the list of
>>   evicted vma. Legacy vm have no need for all these additional lists, so
>>   another huge confusion.
>>
>> - if the refcount is done correctly for vm_bind we wouldn't need the
>>   tricky code in the bo close paths. Unfortunately legacy vm with
>>   relocations and softpin require that vma are only a weak reference, so
>>   that cannot be removed.
>>
>> - there's also a ton of special cases for ggtt handling, like the
>>   different views (for display, partial views for mmap), but also the
>>   gen2/3 alignment and padding requirements which vm_bind never needs.
>>
>> I think the right thing here is to massively split the implementation
>> behind some solid vm/vma abstraction, with a base clase for vm and vma
>> which _only_ has the pieces which both vm_bind and the legacy vm stuff
>> needs. But it's a bit tricky to get there. I think a workable path would
>> be:
>> - Add a new base class to both i915_address_space and i915_vma, which
>>   starts out empty.
>>
>> - As vm_bind code lands, move things that vm_bind code needs into these
>>   base classes
>>
>> - The goal should be that these base classes are a stand-alone library
>>   that other drivers could reuse. Like we've done with the buddy
>>   allocator, which first moved from i915-gem to i915-ttm, and which amd
>>   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>>   interested in adding something like vm_bind should be involved from the
>>   start (or maybe the entire thing reused in amdgpu, they're looking at
>>   vk sparse binding support too or at least have perf issues I think).
>>
>> - Locking must be the same across all implemntations, otherwise it's
>>   really not an abstract. i915 screwed this up terribly by having
>>   different locking rules for ppgtt and ggtt, which is just nonsense.
>>
>> - The legacy specific code needs to be extracted as much as possible and
>>   shoved into separate files. In execbuf this means we need to get back to
>>   object centric flow, and the slowpaths need to become a lot simpler
>>   again (Maarten has cleaned up some of this, but there's still a silly
>>   amount of hacks in there with funny layering).
>>
>> - I think if stuff like the vma eviction details (list movement and
>>   locking and refcounting of the underlying object)
>>
>> > +
>> > +These can be worked upon after intitial vm_bind support is added.
>>
>> I don't think that works, given how badly i915-gem team screwed up in
>> other places. And those places had to be fixed by adopting shared code
>> like ttm. Plus there's already a huge unfulffiled promise pending with the
>> drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>
>> Cheers, Daniel
>>
>> > +
>> > +
>> > +UAPI
>> > +=====
>> > +Uapi definiton can be found here:
>> > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> > index 91e93a705230..7d10c36b268d 100644
>> > --- a/Documentation/gpu/rfc/index.rst
>> > +++ b/Documentation/gpu/rfc/index.rst
>> > @@ -23,3 +23,7 @@ host such documentation:
>> >  .. toctree::
>> >
>> >      i915_scheduler.rst
>> > +
>> > +.. toctree::
>> > +
>> > +    i915_vm_bind.rst
>> > --
>> > 2.21.0.rc0.32.g243a4c7e27
>> >
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>
>
>
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-03-09 15:58     ` [Intel-gfx] " Alex Deucher
@ 2022-04-21  2:08       ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-21  2:08 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Daniel Vetter, Intel Graphics Development, Thomas Hellstrom,
	chris.p.wilson, Maling list - DRI developers

On Wed, Mar 09, 2022 at 10:58:09AM -0500, Alex Deucher wrote:
>On Mon, Mar 7, 2022 at 3:30 PM Niranjana Vishwanathapura
><niranjana.vishwanathapura@intel.com> wrote:
>>
>> VM_BIND design document with description of intended use cases.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  2 files changed, 214 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..cdc6bb25b942
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,210 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> +a specified address space (VM).
>> +
>> +These mappings (also referred to as persistent mappings) will be persistent
>> +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> +having to provide a list of all required mappings during each submission
>> +(as required by older execbuff mode).
>> +
>> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> +flag is set (useful if mapping is required for an active context).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> +
>> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> +but those BOs should be mapped ahead via vm_bind ioctl.
>> +
>> +VM_BIND features include,
>> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +- VA mapping can map to a partial section of the BO (partial binding).
>> +- Support capture of persistent mappings in the dump upon GPU error.
>> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  usecases will be helpful.
>> +- Asynchronous vm_bind and vm_unbind support.
>> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> +  and for signaling batch completion in long running contexts (explained
>> +  below).
>> +
>> +VM_PRIVATE objects
>> +------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +VM_BIND locking order is as below.
>> +
>> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple pagefault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +
>> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> +   while binding a vma and while updating dma-resv fence list of a BO.
>> +   The private BOs of a VM will all share a dma-resv object.
>> +
>> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> +   call for unbinding and during execbuff path for binding the mapping and
>> +   updating the dma-resv fence list of the BO.
>> +
>> +3) Spinlock/s to protect some of the VM's lists.
>> +
>> +We will also need support for bluk LRU movement of persistent mapping to
>> +avoid additional latencies in execbuff path.
>> +
>> +GPU page faults
>> +----------------
>> +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> +using dma-fence to ensure residency.
>> +In future when GPU page faults are supported, no dma-fence usage is required
>> +as residency is purely managed by installing and removing/invalidating ptes.
>> +
>> +
>> +User/Memory Fence
>> +==================
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter.
>> +
>> +User/Memory fence is a <address, value> pair. To signal the user fence,
>> +specified value will be written at the specified virtual address and
>> +wakeup the waiting process. User can wait on an user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal an user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +
>> +VM_BIND use cases
>> +==================
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence. Compute must opt-in for this mechanism with
>> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> +
>> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> +and implicit dependency setting is not allowed on long running contexts.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context fence (called suspend fence) proxying as
>> +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> +which triggers the context preemption.
>> +
>> +This is much easier to support with VM_BIND compared to the current heavier
>> +execbuff path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debuggee) and attached
>> +to GPU via vm_bind interface.
>> +
>> +Mesa/Valkun
>
>s/Valkun/Vulkan/

Thanks Alex,
Will fix.

Niranjana

>
>Alex
>
>> +------------
>> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> +performance. For Vulkan it should be straightforward to use VM_BIND.
>> +For Iris implicit buffer tracking must be implemented before we can harness
>> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> +overhead becomes more important.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +usecases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> +  mapped objects. Page table pages are similar to persistent mappings of a
>> +  VM (difference here are that the page table pages will not
>> +  have an i915_vma structure and after swapping pages back in, parent page
>> +  link needs to be updated).
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> +  dma-resv fence list to determine if a i915_vma is active or not.
>> +
>> +These can be worked upon after intitial vm_bind support is added.
>> +
>> +
>> +UAPI
>> +=====
>> +Uapi definiton can be found here:
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-04-21  2:08       ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-21  2:08 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Daniel Vetter, Intel Graphics Development, Thomas Hellstrom,
	chris.p.wilson, Maling list - DRI developers

On Wed, Mar 09, 2022 at 10:58:09AM -0500, Alex Deucher wrote:
>On Mon, Mar 7, 2022 at 3:30 PM Niranjana Vishwanathapura
><niranjana.vishwanathapura@intel.com> wrote:
>>
>> VM_BIND design document with description of intended use cases.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  2 files changed, 214 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..cdc6bb25b942
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,210 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> +a specified address space (VM).
>> +
>> +These mappings (also referred to as persistent mappings) will be persistent
>> +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> +having to provide a list of all required mappings during each submission
>> +(as required by older execbuff mode).
>> +
>> +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> +flag is set (useful if mapping is required for an active context).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> +
>> +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> +but those BOs should be mapped ahead via vm_bind ioctl.
>> +
>> +VM_BIND features include,
>> +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +- VA mapping can map to a partial section of the BO (partial binding).
>> +- Support capture of persistent mappings in the dump upon GPU error.
>> +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  usecases will be helpful.
>> +- Asynchronous vm_bind and vm_unbind support.
>> +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> +  and for signaling batch completion in long running contexts (explained
>> +  below).
>> +
>> +VM_PRIVATE objects
>> +------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +VM_BIND locking order is as below.
>> +
>> +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple pagefault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +
>> +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> +   while binding a vma and while updating dma-resv fence list of a BO.
>> +   The private BOs of a VM will all share a dma-resv object.
>> +
>> +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> +   call for unbinding and during execbuff path for binding the mapping and
>> +   updating the dma-resv fence list of the BO.
>> +
>> +3) Spinlock/s to protect some of the VM's lists.
>> +
>> +We will also need support for bluk LRU movement of persistent mapping to
>> +avoid additional latencies in execbuff path.
>> +
>> +GPU page faults
>> +----------------
>> +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> +using dma-fence to ensure residency.
>> +In future when GPU page faults are supported, no dma-fence usage is required
>> +as residency is purely managed by installing and removing/invalidating ptes.
>> +
>> +
>> +User/Memory Fence
>> +==================
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter.
>> +
>> +User/Memory fence is a <address, value> pair. To signal the user fence,
>> +specified value will be written at the specified virtual address and
>> +wakeup the waiting process. User can wait on an user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal an user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +
>> +VM_BIND use cases
>> +==================
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence. Compute must opt-in for this mechanism with
>> +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> +
>> +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> +and implicit dependency setting is not allowed on long running contexts.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context fence (called suspend fence) proxying as
>> +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> +which triggers the context preemption.
>> +
>> +This is much easier to support with VM_BIND compared to the current heavier
>> +execbuff path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debuggee) and attached
>> +to GPU via vm_bind interface.
>> +
>> +Mesa/Valkun
>
>s/Valkun/Vulkan/

Thanks Alex,
Will fix.

Niranjana

>
>Alex
>
>> +------------
>> +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> +performance. For Vulkan it should be straightforward to use VM_BIND.
>> +For Iris implicit buffer tracking must be implemented before we can harness
>> +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> +overhead becomes more important.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +usecases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> +  mapped objects. Page table pages are similar to persistent mappings of a
>> +  VM (difference here are that the page table pages will not
>> +  have an i915_vma structure and after swapping pages back in, parent page
>> +  link needs to be updated).
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> +  dma-resv fence list to determine if a i915_vma is active or not.
>> +
>> +These can be worked upon after intitial vm_bind support is added.
>> +
>> +
>> +UAPI
>> +=====
>> +Uapi definiton can be found here:
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-04-20 22:50         ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-04-27 13:53           ` Daniel Vetter
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-04-27 13:53 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: dri-devel, chris.p.wilson, Dave Airlie, intel-gfx, Bloomfield,
	Jon, Ben Skeggs, Jason Ekstrand, daniel.vetter, thomas.hellstrom,
	Christian König

On Wed, Apr 20, 2022 at 03:50:00PM -0700, Niranjana Vishwanathapura wrote:
> On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
> > One thing I've forgotten, since it's only hinted at here: If/when we
> > switch tlb flushing from the current dumb&synchronous implementation
> > we now have in i915 in upstream to one with batching using dma_fence,
> > then I think that should be something which is done with a small
> > helper library of shared code too. The batching is somewhat tricky,
> > and you need to make sure you put the fence into the right
> > dma_resv_usage slot, and the trick with replace the vm fence with a
> > tlb flush fence is also a good reason to share the code so we only
> > have it one.
> > 
> > Christian's recent work also has some prep work for this already with
> > the fence replacing trick.
> 
> Sure, but this optimization is not required for initial vm_bind support
> to land right? We can look at it soon after that. Is that ok?
> I have made a reference to this TLB flush batching work in the rst file.

Yeah for now we can just rely on the tlb flush we do on vma unbinding,
which also means there's no need for any separate tlb flushing in vm_bind
related code. This was just a thought I dropped on here to make sure we
ahve a complete picture.
-Daniel


> 
> Niranjana
> 
> > -Daniel
> > 
> > On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > Adding a pile of people who've expressed interest in vm_bind for their
> > > drivers.
> > > 
> > > Also note to the intel folks: This is largely written with me having my
> > > subsystem co-maintainer hat on, i.e. what I think is the right thing to do
> > > here for the subsystem at large. There is substantial rework involved
> > > here, but it's not any different from i915 adopting ttm or i915 adpoting
> > > drm/sched, and I do think this stuff needs to happen in one form or
> > > another.
> > > 
> > > On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> > > > VM_BIND design document with description of intended use cases.
> > > >
> > > > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > > ---
> > > >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
> > > >  Documentation/gpu/rfc/index.rst        |   4 +
> > > >  2 files changed, 214 insertions(+)
> > > >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> > > >
> > > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > new file mode 100644
> > > > index 000000000000..cdc6bb25b942
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > @@ -0,0 +1,210 @@
> > > > +==========================================
> > > > +I915 VM_BIND feature design and use cases
> > > > +==========================================
> > > > +
> > > > +VM_BIND feature
> > > > +================
> > > > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> > > > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> > > > +a specified address space (VM).
> > > > +
> > > > +These mappings (also referred to as persistent mappings) will be persistent
> > > > +across multiple GPU submissions (execbuff) issued by the UMD, without user
> > > > +having to provide a list of all required mappings during each submission
> > > > +(as required by older execbuff mode).
> > > > +
> > > > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> > > > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> > > > +flag is set (useful if mapping is required for an active context).
> > > 
> > > So this is a screw-up I've done, and for upstream I think we need to fix
> > > it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
> > > I was wrong suggesting we should do this a few years back when we kicked
> > > this off internally :-(
> > > 
> > > What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
> > > things on top:
> > > - in and out fences, like with execbuf, to allow userspace to sync with
> > >   execbuf as needed
> > > - for compute-mode context this means userspace memory fences
> > > - for legacy context this means a timeline syncobj in drm_syncobj
> > > 
> > > No sync_file or anything else like this at all. This means a bunch of
> > > work, but also it'll have benefits because it means we should be able to
> > > use exactly the same code paths and logic for both compute and for legacy
> > > context, because drm_syncobj support future fence semantics.
> > > 
> > > Also on the implementation side we still need to install dma_fence to the
> > > various dma_resv, and for this we need the new dma_resv_usage series from
> > > Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
> > > flag to make sure they never result in an oversync issue with execbuf. I
> > > don't think trying to land vm_bind without that prep work in
> > > dma_resv_usage makes sense.
> > > 
> > > Also as soon as dma_resv_usage has landed there's a few cleanups we should
> > > do in i915:
> > > - ttm bo moving code should probably simplify a bit (and maybe more of the
> > >   code should be pushed as helpers into ttm)
> > > - clflush code should be moved over to using USAGE_KERNEL and the various
> > >   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
> > >   expand on the kernel-doc for cache_dirty") for a bit more context
> > > 
> > > This is still not yet enough, since if a vm_bind races with an eviction we
> > > might stall on the new buffers being readied first before the context can
> > > continue. This needs some care to make sure that vma which aren't fully
> > > bound yet are on a separate list, and vma which are marked for unbinding
> > > are removed from the main working set list as soon as possible.
> > > 
> > > All of these things are relevant for the uapi semantics, which means
> > > - they need to be documented in the uapi kerneldoc, ideally with example
> > >   flows
> > > - umd need to ack this
> > > 
> > > The other thing here is the async/nonblocking path. I think we still need
> > > that one, but again it should not sync with anything going on in execbuf,
> > > but simply execute the ioctl code in a kernel thread. The idea here is
> > > that this works like a special gpu engine, so that compute and vk can
> > > schedule bindings interleaved with rendering. This should be enough to get
> > > a performant vk sparse binding/textures implementation.
> > > 
> > > But I'm not entirely sure on this one, so this definitely needs acks from
> > > umds.
> > > 
> > > > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> > > > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> > > > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> > > > +A VM in VM_BIND mode will not support older execbuff mode of binding.
> > > > +
> > > > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> > > > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> > > > +but those BOs should be mapped ahead via vm_bind ioctl.
> > > 
> > > should or must?
> > > 
> > > Also I'm not really sure that's a great interface. The batchbuffer really
> > > only needs to be an address, so maybe all we need is an extension to
> > > supply an u64 batchbuffer address instead of trying to retrofit this into
> > > an unfitting current uapi.
> > > 
> > > And for implicit sync there's two things:
> > > - for vk I think the right uapi is the dma-buf fence import/export ioctls
> > >   from Jason Ekstrand. I think we should land that first instead of
> > >   hacking funny concepts together
> > > - for gl the dma-buf import/export might not be fast enough, since gl
> > >   needs to do a _lot_ of implicit sync. There we might need to use the
> > >   execbuffer buffer list, but then we should have extremely clear uapi
> > >   rules which disallow _everything_ except setting the explicit sync uapi
> > > 
> > > Again all this stuff needs to be documented in detail in the kerneldoc
> > > uapi spec.
> > > 
> > > > +VM_BIND features include,
> > > > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> > > > +  of an object (aliasing).
> > > > +- VA mapping can map to a partial section of the BO (partial binding).
> > > > +- Support capture of persistent mappings in the dump upon GPU error.
> > > > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> > > > +  usecases will be helpful.
> > > > +- Asynchronous vm_bind and vm_unbind support.
> > > > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> > > > +  and for signaling batch completion in long running contexts (explained
> > > > +  below).
> > > 
> > > This should all be in the kerneldoc.
> > > 
> > > > +VM_PRIVATE objects
> > > > +------------------
> > > > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> > > > +exported. Hence these BOs are referred to as Shared BOs.
> > > > +During each execbuff submission, the request fence must be added to the
> > > > +dma-resv fence list of all shared BOs mapped on the VM.
> > > > +
> > > > +VM_BIND feature introduces an optimization where user can create BO which
> > > > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> > > > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> > > > +the VM they are private to and can't be dma-buf exported.
> > > > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> > > > +submission, they need only one dma-resv fence list updated. Thus the fast
> > > > +path (where required mappings are already bound) submission latency is O(1)
> > > > +w.r.t the number of VM private BOs.
> > > 
> > > Two things:
> > > 
> > > - I think the above is required to for initial vm_bind for vk, it kinda
> > >   doesn't make much sense without that, and will allow us to match amdgpu
> > >   and radeonsi
> > > 
> > > - Christian König just landed ttm bulk lru helpers, and I think we need to
> > >   use those. This means vm_bind will only work with the ttm backend, but
> > >   that's what we have for the big dgpu where vm_bind helps more in terms
> > >   of performance, and the igfx conversion to ttm is already going on.
> > > 
> > > Furthermore the i915 shrinker lru has stopped being an lru, so I think
> > > that should also be moved over to the ttm lru in some fashion to make sure
> > > we once again have a reasonable and consistent memory aging and reclaim
> > > architecture. The current code is just too much of a complete mess.
> > > 
> > > And since this is all fairly integral to how the code arch works I don't
> > > think merging a different version which isn't based on ttm bulk lru
> > > helpers makes sense.
> > > 
> > > Also I do think the page table lru handling needs to be included here,
> > > because that's another complete hand-rolled separate world for not much
> > > good reasons. I guess that can happen in parallel with the initial vm_bind
> > > bring-up, but it needs to be completed by the time we add the features
> > > beyond the initial support needed for vk.
> > > 
> > > > +VM_BIND locking hirarchy
> > > > +-------------------------
> > > > +VM_BIND locking order is as below.
> > > > +
> > > > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> > > > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> > > > +
> > > > +   In future, when GPU page faults are supported, we can potentially use a
> > > > +   rwsem instead, so that multiple pagefault handlers can take the read side
> > > > +   lock to lookup the mapping and hence can run in parallel.
> > > > +
> > > > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> > > > +   while binding a vma and while updating dma-resv fence list of a BO.
> > > > +   The private BOs of a VM will all share a dma-resv object.
> > > > +
> > > > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> > > > +   call for unbinding and during execbuff path for binding the mapping and
> > > > +   updating the dma-resv fence list of the BO.
> > > > +
> > > > +3) Spinlock/s to protect some of the VM's lists.
> > > > +
> > > > +We will also need support for bluk LRU movement of persistent mapping to
> > > > +avoid additional latencies in execbuff path.
> > > 
> > > This needs more detail and explanation of how each level is required. Also
> > > the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
> > > 
> > > Like "some of the VM's lists" explains pretty much nothing.
> > > 
> > > > +
> > > > +GPU page faults
> > > > +----------------
> > > > +Both older execbuff mode and the newer VM_BIND mode of binding will require
> > > > +using dma-fence to ensure residency.
> > > > +In future when GPU page faults are supported, no dma-fence usage is required
> > > > +as residency is purely managed by installing and removing/invalidating ptes.
> > > 
> > > This is a bit confusing. I think one part of this should be moved into the
> > > section with future vm_bind use-cases (we're not going to support page
> > > faults with legacy softpin or even worse, relocations). The locking
> > > discussion should be part of the much longer list of uses cases that
> > > motivate the locking design.
> > > 
> > > > +
> > > > +
> > > > +User/Memory Fence
> > > > +==================
> > > > +The idea is to take a user specified virtual address and install an interrupt
> > > > +handler to wake up the current task when the memory location passes the user
> > > > +supplied filter.
> > > > +
> > > > +User/Memory fence is a <address, value> pair. To signal the user fence,
> > > > +specified value will be written at the specified virtual address and
> > > > +wakeup the waiting process. User can wait on an user fence with the
> > > > +gem_wait_user_fence ioctl.
> > > > +
> > > > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> > > > +interrupt within their batches after updating the value to have sub-batch
> > > > +precision on the wakeup. Each batch can signal an user fence to indicate
> > > > +the completion of next level batch. The completion of very first level batch
> > > > +needs to be signaled by the command streamer. The user must provide the
> > > > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> > > > +extension of execbuff ioctl, so that KMD can setup the command streamer to
> > > > +signal it.
> > > > +
> > > > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> > > > +the user process after completion of an asynchronous operation.
> > > > +
> > > > +When VM_BIND ioctl was provided with a user/memory fence via the
> > > > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> > > > +of binding of that mapping. All async binds/unbinds are serialized, hence
> > > > +signaling of user/memory fence also indicate the completion of all previous
> > > > +binds/unbinds.
> > > > +
> > > > +This feature will be derived from the below original work:
> > > > +https://patchwork.freedesktop.org/patch/349417/
> > > 
> > > This is 1:1 tied to long running compute mode contexts (which in the uapi
> > > doc must reference the endless amounts of bikeshed summary we have in the
> > > docs about indefinite fences).
> > > 
> > > I'd put this into a new section about compute and userspace memory fences
> > > support, with this and the next chapter ...
> > > > +
> > > > +
> > > > +VM_BIND use cases
> > > > +==================
> > > 
> > > ... and then make this section here focus entirely on additional vm_bind
> > > use-cases that we'll be adding later on. Which doesn't need to go into any
> > > details, it's just justification for why we want to build the world on top
> > > of vm_bind.
> > > 
> > > > +
> > > > +Long running Compute contexts
> > > > +------------------------------
> > > > +Usage of dma-fence expects that they complete in reasonable amount of time.
> > > > +Compute on the other hand can be long running. Hence it is appropriate for
> > > > +compute to use user/memory fence and dma-fence usage will be limited to
> > > > +in-kernel consumption only. This requires an execbuff uapi extension to pass
> > > > +in user fence. Compute must opt-in for this mechanism with
> > > > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> > > > +
> > > > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> > > > +and implicit dependency setting is not allowed on long running contexts.
> > > > +
> > > > +Where GPU page faults are not available, kernel driver upon buffer invalidation
> > > > +will initiate a suspend (preemption) of long running context with a dma-fence
> > > > +attached to it. And upon completion of that suspend fence, finish the
> > > > +invalidation, revalidate the BO and then resume the compute context. This is
> > > > +done by having a per-context fence (called suspend fence) proxying as
> > > > +i915_request fence. This suspend fence is enabled when there is a wait on it,
> > > > +which triggers the context preemption.
> > > > +
> > > > +This is much easier to support with VM_BIND compared to the current heavier
> > > > +execbuff path resource attachment.
> > > 
> > > There's a bunch of tricky code around compute mode context support, like
> > > the preempt ctx fence (or suspend fence or whatever you want to call it),
> > > and the resume work. And I think that code should be shared across
> > > drivers.
> > > 
> > > I think the right place to put this is into drm/sched, somewhere attached
> > > to the drm_sched_entity structure. I expect i915 folks to collaborate with
> > > amd and ideally also get amdkfd to adopt the same thing if possible. At
> > > least Christian has mentioned in the past that he's a bit unhappy about
> > > how this works.
> > > 
> > > Also drm/sched has dependency tracking, which will be needed to pipeline
> > > context resume operations. That needs to be used instead of i915-gem
> > > inventing yet another dependency tracking data structure (it already has 3
> > > and that's roughly 3 too many).
> > > 
> > > This means compute mode support and userspace memory fences are blocked on
> > > the drm/sched conversion, but *eh* add it to the list of reasons for why
> > > drm/sched needs to happen.
> > > 
> > > Also since we only have support for compute mode ctx in our internal tree
> > > with the guc scheduler backend anyway, and the first conversion target is
> > > the guc backend, I don't think this actually holds up a lot of the code.
> > > 
> > > > +Low Latency Submission
> > > > +-----------------------
> > > > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> > > > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
> > > 
> > > This is really just a special case of compute mode contexts, I think I'd
> > > include that in there, but explain better what it requires (i.e. vm_bind
> > > not being synchronized against execbuf).
> > > 
> > > > +
> > > > +Debugger
> > > > +---------
> > > > +With debug event interface user space process (debugger) is able to keep track
> > > > +of and act upon resources created by another process (debuggee) and attached
> > > > +to GPU via vm_bind interface.
> > > > +
> > > > +Mesa/Valkun
> > > > +------------
> > > > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> > > > +performance. For Vulkan it should be straightforward to use VM_BIND.
> > > > +For Iris implicit buffer tracking must be implemented before we can harness
> > > > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> > > > +overhead becomes more important.
> > > 
> > > Just to clarify, I don't think we can land vm_bind into upstream if it
> > > doesn't work 100% for vk. There's a bit much "can" instead of "will in
> > > this section".
> > > 
> > > > +
> > > > +Page level hints settings
> > > > +--------------------------
> > > > +VM_BIND allows any hints setting per mapping instead of per BO.
> > > > +Possible hints include read-only, placement and atomicity.
> > > > +Sub-BO level placement hint will be even more relevant with
> > > > +upcoming GPU on-demand page fault support.
> > > > +
> > > > +Page level Cache/CLOS settings
> > > > +-------------------------------
> > > > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> > > > +
> > > > +Shared Virtual Memory (SVM) support
> > > > +------------------------------------
> > > > +VM_BIND interface can be used to map system memory directly (without gem BO
> > > > +abstraction) using the HMM interface.
> > > 
> > > Userptr is absent here (and it's not the same as svm, at least on
> > > discrete), and this is needed for the initial version since otherwise vk
> > > can't use it because we're not at feature parity.
> > > 
> > > Irc discussions by Maarten and Dave came up with the idea that maybe
> > > userptr for vm_bind should work _without_ any gem bo as backing storage,
> > > since that guarantees that people don't come up with funny ideas like
> > > trying to share such bo across process or mmap it and other nonsense which
> > > just doesn't work.
> > > 
> > > > +
> > > > +
> > > > +Broder i915 cleanups
> > > > +=====================
> > > > +Supporting this whole new vm_bind mode of binding which comes with its own
> > > > +usecases to support and the locking requirements requires proper integration
> > > > +with the existing i915 driver. This calls for some broader i915 driver
> > > > +cleanups/simplifications for maintainability of the driver going forward.
> > > > +Here are few things identified and are being looked into.
> > > > +
> > > > +- Make pagetable allocations evictable and manage them similar to VM_BIND
> > > > +  mapped objects. Page table pages are similar to persistent mappings of a
> > > > +  VM (difference here are that the page table pages will not
> > > > +  have an i915_vma structure and after swapping pages back in, parent page
> > > > +  link needs to be updated).
> > > 
> > > See above, but I think this should be included as part of the initial
> > > vm_bind push.
> > > 
> > > > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> > > > +  do not use it and complexity it brings in is probably more than the
> > > > +  performance advantage we get in legacy execbuff case.
> > > > +- Remove vma->open_count counting
> > > > +- Remove i915_vma active reference tracking. Instead use underlying BO's
> > > > +  dma-resv fence list to determine if a i915_vma is active or not.
> > > 
> > > So this is a complete mess, and really should not exist. I think it needs
> > > to be removed before we try to make i915_vma even more complex by adding
> > > vm_bind.
> > > 
> > > The other thing I've been pondering here is that vm_bind is really
> > > completely different from legacy vm structures for a lot of reasons:
> > > - no relocation or softpin handling, which means vm_bind has no reason to
> > >   ever look at the i915_vma structure in execbuf code. Unfortunately
> > >   execbuf has been rewritten to be vma instead of obj centric, so it's a
> > >   100% mismatch
> > > 
> > > - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
> > >   that because the kernel manages the virtual address space fully. Again
> > >   ideally that entire vma_move_to_active code and everything related to it
> > >   would simply not exist.
> > > 
> > > - similar on the eviction side, the rules are quite different: For vm_bind
> > >   we never tear down the vma, instead it's just moved to the list of
> > >   evicted vma. Legacy vm have no need for all these additional lists, so
> > >   another huge confusion.
> > > 
> > > - if the refcount is done correctly for vm_bind we wouldn't need the
> > >   tricky code in the bo close paths. Unfortunately legacy vm with
> > >   relocations and softpin require that vma are only a weak reference, so
> > >   that cannot be removed.
> > > 
> > > - there's also a ton of special cases for ggtt handling, like the
> > >   different views (for display, partial views for mmap), but also the
> > >   gen2/3 alignment and padding requirements which vm_bind never needs.
> > > 
> > > I think the right thing here is to massively split the implementation
> > > behind some solid vm/vma abstraction, with a base clase for vm and vma
> > > which _only_ has the pieces which both vm_bind and the legacy vm stuff
> > > needs. But it's a bit tricky to get there. I think a workable path would
> > > be:
> > > - Add a new base class to both i915_address_space and i915_vma, which
> > >   starts out empty.
> > > 
> > > - As vm_bind code lands, move things that vm_bind code needs into these
> > >   base classes
> > > 
> > > - The goal should be that these base classes are a stand-alone library
> > >   that other drivers could reuse. Like we've done with the buddy
> > >   allocator, which first moved from i915-gem to i915-ttm, and which amd
> > >   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
> > >   interested in adding something like vm_bind should be involved from the
> > >   start (or maybe the entire thing reused in amdgpu, they're looking at
> > >   vk sparse binding support too or at least have perf issues I think).
> > > 
> > > - Locking must be the same across all implemntations, otherwise it's
> > >   really not an abstract. i915 screwed this up terribly by having
> > >   different locking rules for ppgtt and ggtt, which is just nonsense.
> > > 
> > > - The legacy specific code needs to be extracted as much as possible and
> > >   shoved into separate files. In execbuf this means we need to get back to
> > >   object centric flow, and the slowpaths need to become a lot simpler
> > >   again (Maarten has cleaned up some of this, but there's still a silly
> > >   amount of hacks in there with funny layering).
> > > 
> > > - I think if stuff like the vma eviction details (list movement and
> > >   locking and refcounting of the underlying object)
> > > 
> > > > +
> > > > +These can be worked upon after intitial vm_bind support is added.
> > > 
> > > I don't think that works, given how badly i915-gem team screwed up in
> > > other places. And those places had to be fixed by adopting shared code
> > > like ttm. Plus there's already a huge unfulffiled promise pending with the
> > > drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
> > > 
> > > Cheers, Daniel
> > > 
> > > > +
> > > > +
> > > > +UAPI
> > > > +=====
> > > > +Uapi definiton can be found here:
> > > > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > > index 91e93a705230..7d10c36b268d 100644
> > > > --- a/Documentation/gpu/rfc/index.rst
> > > > +++ b/Documentation/gpu/rfc/index.rst
> > > > @@ -23,3 +23,7 @@ host such documentation:
> > > >  .. toctree::
> > > >
> > > >      i915_scheduler.rst
> > > > +
> > > > +.. toctree::
> > > > +
> > > > +    i915_vm_bind.rst
> > > > --
> > > > 2.21.0.rc0.32.g243a4c7e27
> > > >
> > > 
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > 
> > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-04-27 13:53           ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-04-27 13:53 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: dri-devel, chris.p.wilson, Dave Airlie, intel-gfx, Daniel Stone,
	Ben Skeggs, daniel.vetter, thomas.hellstrom,
	Christian König

On Wed, Apr 20, 2022 at 03:50:00PM -0700, Niranjana Vishwanathapura wrote:
> On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
> > One thing I've forgotten, since it's only hinted at here: If/when we
> > switch tlb flushing from the current dumb&synchronous implementation
> > we now have in i915 in upstream to one with batching using dma_fence,
> > then I think that should be something which is done with a small
> > helper library of shared code too. The batching is somewhat tricky,
> > and you need to make sure you put the fence into the right
> > dma_resv_usage slot, and the trick with replace the vm fence with a
> > tlb flush fence is also a good reason to share the code so we only
> > have it one.
> > 
> > Christian's recent work also has some prep work for this already with
> > the fence replacing trick.
> 
> Sure, but this optimization is not required for initial vm_bind support
> to land right? We can look at it soon after that. Is that ok?
> I have made a reference to this TLB flush batching work in the rst file.

Yeah for now we can just rely on the tlb flush we do on vma unbinding,
which also means there's no need for any separate tlb flushing in vm_bind
related code. This was just a thought I dropped on here to make sure we
ahve a complete picture.
-Daniel


> 
> Niranjana
> 
> > -Daniel
> > 
> > On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > Adding a pile of people who've expressed interest in vm_bind for their
> > > drivers.
> > > 
> > > Also note to the intel folks: This is largely written with me having my
> > > subsystem co-maintainer hat on, i.e. what I think is the right thing to do
> > > here for the subsystem at large. There is substantial rework involved
> > > here, but it's not any different from i915 adopting ttm or i915 adpoting
> > > drm/sched, and I do think this stuff needs to happen in one form or
> > > another.
> > > 
> > > On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> > > > VM_BIND design document with description of intended use cases.
> > > >
> > > > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > > ---
> > > >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
> > > >  Documentation/gpu/rfc/index.rst        |   4 +
> > > >  2 files changed, 214 insertions(+)
> > > >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> > > >
> > > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > new file mode 100644
> > > > index 000000000000..cdc6bb25b942
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > @@ -0,0 +1,210 @@
> > > > +==========================================
> > > > +I915 VM_BIND feature design and use cases
> > > > +==========================================
> > > > +
> > > > +VM_BIND feature
> > > > +================
> > > > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> > > > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> > > > +a specified address space (VM).
> > > > +
> > > > +These mappings (also referred to as persistent mappings) will be persistent
> > > > +across multiple GPU submissions (execbuff) issued by the UMD, without user
> > > > +having to provide a list of all required mappings during each submission
> > > > +(as required by older execbuff mode).
> > > > +
> > > > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> > > > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> > > > +flag is set (useful if mapping is required for an active context).
> > > 
> > > So this is a screw-up I've done, and for upstream I think we need to fix
> > > it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
> > > I was wrong suggesting we should do this a few years back when we kicked
> > > this off internally :-(
> > > 
> > > What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
> > > things on top:
> > > - in and out fences, like with execbuf, to allow userspace to sync with
> > >   execbuf as needed
> > > - for compute-mode context this means userspace memory fences
> > > - for legacy context this means a timeline syncobj in drm_syncobj
> > > 
> > > No sync_file or anything else like this at all. This means a bunch of
> > > work, but also it'll have benefits because it means we should be able to
> > > use exactly the same code paths and logic for both compute and for legacy
> > > context, because drm_syncobj support future fence semantics.
> > > 
> > > Also on the implementation side we still need to install dma_fence to the
> > > various dma_resv, and for this we need the new dma_resv_usage series from
> > > Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
> > > flag to make sure they never result in an oversync issue with execbuf. I
> > > don't think trying to land vm_bind without that prep work in
> > > dma_resv_usage makes sense.
> > > 
> > > Also as soon as dma_resv_usage has landed there's a few cleanups we should
> > > do in i915:
> > > - ttm bo moving code should probably simplify a bit (and maybe more of the
> > >   code should be pushed as helpers into ttm)
> > > - clflush code should be moved over to using USAGE_KERNEL and the various
> > >   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
> > >   expand on the kernel-doc for cache_dirty") for a bit more context
> > > 
> > > This is still not yet enough, since if a vm_bind races with an eviction we
> > > might stall on the new buffers being readied first before the context can
> > > continue. This needs some care to make sure that vma which aren't fully
> > > bound yet are on a separate list, and vma which are marked for unbinding
> > > are removed from the main working set list as soon as possible.
> > > 
> > > All of these things are relevant for the uapi semantics, which means
> > > - they need to be documented in the uapi kerneldoc, ideally with example
> > >   flows
> > > - umd need to ack this
> > > 
> > > The other thing here is the async/nonblocking path. I think we still need
> > > that one, but again it should not sync with anything going on in execbuf,
> > > but simply execute the ioctl code in a kernel thread. The idea here is
> > > that this works like a special gpu engine, so that compute and vk can
> > > schedule bindings interleaved with rendering. This should be enough to get
> > > a performant vk sparse binding/textures implementation.
> > > 
> > > But I'm not entirely sure on this one, so this definitely needs acks from
> > > umds.
> > > 
> > > > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> > > > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> > > > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> > > > +A VM in VM_BIND mode will not support older execbuff mode of binding.
> > > > +
> > > > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> > > > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> > > > +but those BOs should be mapped ahead via vm_bind ioctl.
> > > 
> > > should or must?
> > > 
> > > Also I'm not really sure that's a great interface. The batchbuffer really
> > > only needs to be an address, so maybe all we need is an extension to
> > > supply an u64 batchbuffer address instead of trying to retrofit this into
> > > an unfitting current uapi.
> > > 
> > > And for implicit sync there's two things:
> > > - for vk I think the right uapi is the dma-buf fence import/export ioctls
> > >   from Jason Ekstrand. I think we should land that first instead of
> > >   hacking funny concepts together
> > > - for gl the dma-buf import/export might not be fast enough, since gl
> > >   needs to do a _lot_ of implicit sync. There we might need to use the
> > >   execbuffer buffer list, but then we should have extremely clear uapi
> > >   rules which disallow _everything_ except setting the explicit sync uapi
> > > 
> > > Again all this stuff needs to be documented in detail in the kerneldoc
> > > uapi spec.
> > > 
> > > > +VM_BIND features include,
> > > > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> > > > +  of an object (aliasing).
> > > > +- VA mapping can map to a partial section of the BO (partial binding).
> > > > +- Support capture of persistent mappings in the dump upon GPU error.
> > > > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> > > > +  usecases will be helpful.
> > > > +- Asynchronous vm_bind and vm_unbind support.
> > > > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> > > > +  and for signaling batch completion in long running contexts (explained
> > > > +  below).
> > > 
> > > This should all be in the kerneldoc.
> > > 
> > > > +VM_PRIVATE objects
> > > > +------------------
> > > > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> > > > +exported. Hence these BOs are referred to as Shared BOs.
> > > > +During each execbuff submission, the request fence must be added to the
> > > > +dma-resv fence list of all shared BOs mapped on the VM.
> > > > +
> > > > +VM_BIND feature introduces an optimization where user can create BO which
> > > > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> > > > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> > > > +the VM they are private to and can't be dma-buf exported.
> > > > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> > > > +submission, they need only one dma-resv fence list updated. Thus the fast
> > > > +path (where required mappings are already bound) submission latency is O(1)
> > > > +w.r.t the number of VM private BOs.
> > > 
> > > Two things:
> > > 
> > > - I think the above is required to for initial vm_bind for vk, it kinda
> > >   doesn't make much sense without that, and will allow us to match amdgpu
> > >   and radeonsi
> > > 
> > > - Christian König just landed ttm bulk lru helpers, and I think we need to
> > >   use those. This means vm_bind will only work with the ttm backend, but
> > >   that's what we have for the big dgpu where vm_bind helps more in terms
> > >   of performance, and the igfx conversion to ttm is already going on.
> > > 
> > > Furthermore the i915 shrinker lru has stopped being an lru, so I think
> > > that should also be moved over to the ttm lru in some fashion to make sure
> > > we once again have a reasonable and consistent memory aging and reclaim
> > > architecture. The current code is just too much of a complete mess.
> > > 
> > > And since this is all fairly integral to how the code arch works I don't
> > > think merging a different version which isn't based on ttm bulk lru
> > > helpers makes sense.
> > > 
> > > Also I do think the page table lru handling needs to be included here,
> > > because that's another complete hand-rolled separate world for not much
> > > good reasons. I guess that can happen in parallel with the initial vm_bind
> > > bring-up, but it needs to be completed by the time we add the features
> > > beyond the initial support needed for vk.
> > > 
> > > > +VM_BIND locking hirarchy
> > > > +-------------------------
> > > > +VM_BIND locking order is as below.
> > > > +
> > > > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> > > > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> > > > +
> > > > +   In future, when GPU page faults are supported, we can potentially use a
> > > > +   rwsem instead, so that multiple pagefault handlers can take the read side
> > > > +   lock to lookup the mapping and hence can run in parallel.
> > > > +
> > > > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> > > > +   while binding a vma and while updating dma-resv fence list of a BO.
> > > > +   The private BOs of a VM will all share a dma-resv object.
> > > > +
> > > > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> > > > +   call for unbinding and during execbuff path for binding the mapping and
> > > > +   updating the dma-resv fence list of the BO.
> > > > +
> > > > +3) Spinlock/s to protect some of the VM's lists.
> > > > +
> > > > +We will also need support for bluk LRU movement of persistent mapping to
> > > > +avoid additional latencies in execbuff path.
> > > 
> > > This needs more detail and explanation of how each level is required. Also
> > > the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
> > > 
> > > Like "some of the VM's lists" explains pretty much nothing.
> > > 
> > > > +
> > > > +GPU page faults
> > > > +----------------
> > > > +Both older execbuff mode and the newer VM_BIND mode of binding will require
> > > > +using dma-fence to ensure residency.
> > > > +In future when GPU page faults are supported, no dma-fence usage is required
> > > > +as residency is purely managed by installing and removing/invalidating ptes.
> > > 
> > > This is a bit confusing. I think one part of this should be moved into the
> > > section with future vm_bind use-cases (we're not going to support page
> > > faults with legacy softpin or even worse, relocations). The locking
> > > discussion should be part of the much longer list of uses cases that
> > > motivate the locking design.
> > > 
> > > > +
> > > > +
> > > > +User/Memory Fence
> > > > +==================
> > > > +The idea is to take a user specified virtual address and install an interrupt
> > > > +handler to wake up the current task when the memory location passes the user
> > > > +supplied filter.
> > > > +
> > > > +User/Memory fence is a <address, value> pair. To signal the user fence,
> > > > +specified value will be written at the specified virtual address and
> > > > +wakeup the waiting process. User can wait on an user fence with the
> > > > +gem_wait_user_fence ioctl.
> > > > +
> > > > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> > > > +interrupt within their batches after updating the value to have sub-batch
> > > > +precision on the wakeup. Each batch can signal an user fence to indicate
> > > > +the completion of next level batch. The completion of very first level batch
> > > > +needs to be signaled by the command streamer. The user must provide the
> > > > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> > > > +extension of execbuff ioctl, so that KMD can setup the command streamer to
> > > > +signal it.
> > > > +
> > > > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> > > > +the user process after completion of an asynchronous operation.
> > > > +
> > > > +When VM_BIND ioctl was provided with a user/memory fence via the
> > > > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> > > > +of binding of that mapping. All async binds/unbinds are serialized, hence
> > > > +signaling of user/memory fence also indicate the completion of all previous
> > > > +binds/unbinds.
> > > > +
> > > > +This feature will be derived from the below original work:
> > > > +https://patchwork.freedesktop.org/patch/349417/
> > > 
> > > This is 1:1 tied to long running compute mode contexts (which in the uapi
> > > doc must reference the endless amounts of bikeshed summary we have in the
> > > docs about indefinite fences).
> > > 
> > > I'd put this into a new section about compute and userspace memory fences
> > > support, with this and the next chapter ...
> > > > +
> > > > +
> > > > +VM_BIND use cases
> > > > +==================
> > > 
> > > ... and then make this section here focus entirely on additional vm_bind
> > > use-cases that we'll be adding later on. Which doesn't need to go into any
> > > details, it's just justification for why we want to build the world on top
> > > of vm_bind.
> > > 
> > > > +
> > > > +Long running Compute contexts
> > > > +------------------------------
> > > > +Usage of dma-fence expects that they complete in reasonable amount of time.
> > > > +Compute on the other hand can be long running. Hence it is appropriate for
> > > > +compute to use user/memory fence and dma-fence usage will be limited to
> > > > +in-kernel consumption only. This requires an execbuff uapi extension to pass
> > > > +in user fence. Compute must opt-in for this mechanism with
> > > > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> > > > +
> > > > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> > > > +and implicit dependency setting is not allowed on long running contexts.
> > > > +
> > > > +Where GPU page faults are not available, kernel driver upon buffer invalidation
> > > > +will initiate a suspend (preemption) of long running context with a dma-fence
> > > > +attached to it. And upon completion of that suspend fence, finish the
> > > > +invalidation, revalidate the BO and then resume the compute context. This is
> > > > +done by having a per-context fence (called suspend fence) proxying as
> > > > +i915_request fence. This suspend fence is enabled when there is a wait on it,
> > > > +which triggers the context preemption.
> > > > +
> > > > +This is much easier to support with VM_BIND compared to the current heavier
> > > > +execbuff path resource attachment.
> > > 
> > > There's a bunch of tricky code around compute mode context support, like
> > > the preempt ctx fence (or suspend fence or whatever you want to call it),
> > > and the resume work. And I think that code should be shared across
> > > drivers.
> > > 
> > > I think the right place to put this is into drm/sched, somewhere attached
> > > to the drm_sched_entity structure. I expect i915 folks to collaborate with
> > > amd and ideally also get amdkfd to adopt the same thing if possible. At
> > > least Christian has mentioned in the past that he's a bit unhappy about
> > > how this works.
> > > 
> > > Also drm/sched has dependency tracking, which will be needed to pipeline
> > > context resume operations. That needs to be used instead of i915-gem
> > > inventing yet another dependency tracking data structure (it already has 3
> > > and that's roughly 3 too many).
> > > 
> > > This means compute mode support and userspace memory fences are blocked on
> > > the drm/sched conversion, but *eh* add it to the list of reasons for why
> > > drm/sched needs to happen.
> > > 
> > > Also since we only have support for compute mode ctx in our internal tree
> > > with the guc scheduler backend anyway, and the first conversion target is
> > > the guc backend, I don't think this actually holds up a lot of the code.
> > > 
> > > > +Low Latency Submission
> > > > +-----------------------
> > > > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> > > > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
> > > 
> > > This is really just a special case of compute mode contexts, I think I'd
> > > include that in there, but explain better what it requires (i.e. vm_bind
> > > not being synchronized against execbuf).
> > > 
> > > > +
> > > > +Debugger
> > > > +---------
> > > > +With debug event interface user space process (debugger) is able to keep track
> > > > +of and act upon resources created by another process (debuggee) and attached
> > > > +to GPU via vm_bind interface.
> > > > +
> > > > +Mesa/Valkun
> > > > +------------
> > > > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> > > > +performance. For Vulkan it should be straightforward to use VM_BIND.
> > > > +For Iris implicit buffer tracking must be implemented before we can harness
> > > > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> > > > +overhead becomes more important.
> > > 
> > > Just to clarify, I don't think we can land vm_bind into upstream if it
> > > doesn't work 100% for vk. There's a bit much "can" instead of "will in
> > > this section".
> > > 
> > > > +
> > > > +Page level hints settings
> > > > +--------------------------
> > > > +VM_BIND allows any hints setting per mapping instead of per BO.
> > > > +Possible hints include read-only, placement and atomicity.
> > > > +Sub-BO level placement hint will be even more relevant with
> > > > +upcoming GPU on-demand page fault support.
> > > > +
> > > > +Page level Cache/CLOS settings
> > > > +-------------------------------
> > > > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> > > > +
> > > > +Shared Virtual Memory (SVM) support
> > > > +------------------------------------
> > > > +VM_BIND interface can be used to map system memory directly (without gem BO
> > > > +abstraction) using the HMM interface.
> > > 
> > > Userptr is absent here (and it's not the same as svm, at least on
> > > discrete), and this is needed for the initial version since otherwise vk
> > > can't use it because we're not at feature parity.
> > > 
> > > Irc discussions by Maarten and Dave came up with the idea that maybe
> > > userptr for vm_bind should work _without_ any gem bo as backing storage,
> > > since that guarantees that people don't come up with funny ideas like
> > > trying to share such bo across process or mmap it and other nonsense which
> > > just doesn't work.
> > > 
> > > > +
> > > > +
> > > > +Broder i915 cleanups
> > > > +=====================
> > > > +Supporting this whole new vm_bind mode of binding which comes with its own
> > > > +usecases to support and the locking requirements requires proper integration
> > > > +with the existing i915 driver. This calls for some broader i915 driver
> > > > +cleanups/simplifications for maintainability of the driver going forward.
> > > > +Here are few things identified and are being looked into.
> > > > +
> > > > +- Make pagetable allocations evictable and manage them similar to VM_BIND
> > > > +  mapped objects. Page table pages are similar to persistent mappings of a
> > > > +  VM (difference here are that the page table pages will not
> > > > +  have an i915_vma structure and after swapping pages back in, parent page
> > > > +  link needs to be updated).
> > > 
> > > See above, but I think this should be included as part of the initial
> > > vm_bind push.
> > > 
> > > > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> > > > +  do not use it and complexity it brings in is probably more than the
> > > > +  performance advantage we get in legacy execbuff case.
> > > > +- Remove vma->open_count counting
> > > > +- Remove i915_vma active reference tracking. Instead use underlying BO's
> > > > +  dma-resv fence list to determine if a i915_vma is active or not.
> > > 
> > > So this is a complete mess, and really should not exist. I think it needs
> > > to be removed before we try to make i915_vma even more complex by adding
> > > vm_bind.
> > > 
> > > The other thing I've been pondering here is that vm_bind is really
> > > completely different from legacy vm structures for a lot of reasons:
> > > - no relocation or softpin handling, which means vm_bind has no reason to
> > >   ever look at the i915_vma structure in execbuf code. Unfortunately
> > >   execbuf has been rewritten to be vma instead of obj centric, so it's a
> > >   100% mismatch
> > > 
> > > - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
> > >   that because the kernel manages the virtual address space fully. Again
> > >   ideally that entire vma_move_to_active code and everything related to it
> > >   would simply not exist.
> > > 
> > > - similar on the eviction side, the rules are quite different: For vm_bind
> > >   we never tear down the vma, instead it's just moved to the list of
> > >   evicted vma. Legacy vm have no need for all these additional lists, so
> > >   another huge confusion.
> > > 
> > > - if the refcount is done correctly for vm_bind we wouldn't need the
> > >   tricky code in the bo close paths. Unfortunately legacy vm with
> > >   relocations and softpin require that vma are only a weak reference, so
> > >   that cannot be removed.
> > > 
> > > - there's also a ton of special cases for ggtt handling, like the
> > >   different views (for display, partial views for mmap), but also the
> > >   gen2/3 alignment and padding requirements which vm_bind never needs.
> > > 
> > > I think the right thing here is to massively split the implementation
> > > behind some solid vm/vma abstraction, with a base clase for vm and vma
> > > which _only_ has the pieces which both vm_bind and the legacy vm stuff
> > > needs. But it's a bit tricky to get there. I think a workable path would
> > > be:
> > > - Add a new base class to both i915_address_space and i915_vma, which
> > >   starts out empty.
> > > 
> > > - As vm_bind code lands, move things that vm_bind code needs into these
> > >   base classes
> > > 
> > > - The goal should be that these base classes are a stand-alone library
> > >   that other drivers could reuse. Like we've done with the buddy
> > >   allocator, which first moved from i915-gem to i915-ttm, and which amd
> > >   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
> > >   interested in adding something like vm_bind should be involved from the
> > >   start (or maybe the entire thing reused in amdgpu, they're looking at
> > >   vk sparse binding support too or at least have perf issues I think).
> > > 
> > > - Locking must be the same across all implemntations, otherwise it's
> > >   really not an abstract. i915 screwed this up terribly by having
> > >   different locking rules for ppgtt and ggtt, which is just nonsense.
> > > 
> > > - The legacy specific code needs to be extracted as much as possible and
> > >   shoved into separate files. In execbuf this means we need to get back to
> > >   object centric flow, and the slowpaths need to become a lot simpler
> > >   again (Maarten has cleaned up some of this, but there's still a silly
> > >   amount of hacks in there with funny layering).
> > > 
> > > - I think if stuff like the vma eviction details (list movement and
> > >   locking and refcounting of the underlying object)
> > > 
> > > > +
> > > > +These can be worked upon after intitial vm_bind support is added.
> > > 
> > > I don't think that works, given how badly i915-gem team screwed up in
> > > other places. And those places had to be fixed by adopting shared code
> > > like ttm. Plus there's already a huge unfulffiled promise pending with the
> > > drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
> > > 
> > > Cheers, Daniel
> > > 
> > > > +
> > > > +
> > > > +UAPI
> > > > +=====
> > > > +Uapi definiton can be found here:
> > > > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > > index 91e93a705230..7d10c36b268d 100644
> > > > --- a/Documentation/gpu/rfc/index.rst
> > > > +++ b/Documentation/gpu/rfc/index.rst
> > > > @@ -23,3 +23,7 @@ host such documentation:
> > > >  .. toctree::
> > > >
> > > >      i915_scheduler.rst
> > > > +
> > > > +.. toctree::
> > > > +
> > > > +    i915_vm_bind.rst
> > > > --
> > > > 2.21.0.rc0.32.g243a4c7e27
> > > >
> > > 
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > 
> > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-04-20 22:45       ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-04-27 15:41         ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-27 15:41 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, intel-gfx, dri-devel, Bloomfield, Jon,
	chris.p.wilson, Jason Ekstrand, daniel.vetter, thomas.hellstrom,
	Christian König, Ben Skeggs

On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
>>Adding a pile of people who've expressed interest in vm_bind for their
>>drivers.
>>
>>Also note to the intel folks: This is largely written with me having my
>>subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>>here for the subsystem at large. There is substantial rework involved
>>here, but it's not any different from i915 adopting ttm or i915 adpoting
>>drm/sched, and I do think this stuff needs to happen in one form or
>>another.
>>
>>On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>> Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>> Documentation/gpu/rfc/index.rst        |   4 +
>>> 2 files changed, 214 insertions(+)
>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..cdc6bb25b942
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,210 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>>>+a specified address space (VM).
>>>+
>>>+These mappings (also referred to as persistent mappings) will be persistent
>>>+across multiple GPU submissions (execbuff) issued by the UMD, without user
>>>+having to provide a list of all required mappings during each submission
>>>+(as required by older execbuff mode).
>>>+
>>>+VM_BIND ioctl deferes binding the mappings until next execbuff submission
>>>+where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>>>+flag is set (useful if mapping is required for an active context).
>>
>>So this is a screw-up I've done, and for upstream I think we need to fix
>>it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>>I was wrong suggesting we should do this a few years back when we kicked
>>this off internally :-(
>>
>>What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>>things on top:
>>- in and out fences, like with execbuf, to allow userspace to sync with
>> execbuf as needed
>>- for compute-mode context this means userspace memory fences
>>- for legacy context this means a timeline syncobj in drm_syncobj
>>
>>No sync_file or anything else like this at all. This means a bunch of
>>work, but also it'll have benefits because it means we should be able to
>>use exactly the same code paths and logic for both compute and for legacy
>>context, because drm_syncobj support future fence semantics.
>>
>
>Thanks Daniel,
>Ok, will update
>

I had a long conversation with Daniel on some of the points discussed here.
Thanks to Daniel for clarifying many points here.

Here is the summary of the discussion.

1) A prep patch is needed to update documentation of some existing uapi and this
   new VM_BIND uapi can update/refer to that.
   I will include this prep patch in the next revision of this RFC series.
   Will also include the uapi header file in the rst file so that it gets rendered.

2) Will update documentation here with proper use of dma_resv_usage while adding
   fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default
   if not override with execlist in execbuff path.

3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
   instead of having to pass it as a BO handle in execlist. This will also make the
   execlist usage solely for implicit sync setting which is further discussed below.

4) Need to look into when will Jason's dma-buf fence import/export ioctl support will
   land and whether it will be used both for vl and gl. Need to sync with Jason on this.
   Probably the better option here would be to not support execlist in execbuff path in
   vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export
   ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode
   later if required (say for gl).

5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
   relocations, implicit sync etc). Separate them out by using function pointers wherever
   the functionality differs between current design and the newer VM_BIND design.

6) Separate out i915_vma active reference counting in execbuff path and do not use it in
   VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier
   to get it working with the current TTM backend (which initial VM_BIND support will use).
   And remove i915_vma active reference counting fully while supporting TTM backend for igfx.

7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires
   support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler
   conversion.

Will revise this series accordingly.

Thanks,
Niranjana

>>Also on the implementation side we still need to install dma_fence to the
>>various dma_resv, and for this we need the new dma_resv_usage series from
>>Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>>flag to make sure they never result in an oversync issue with execbuf. I
>>don't think trying to land vm_bind without that prep work in
>>dma_resv_usage makes sense.
>>
>
>Ok, but that is not a dependency for this VM_BIND design RFC patch right?
>I will add this to the documentation here.
>
>>Also as soon as dma_resv_usage has landed there's a few cleanups we should
>>do in i915:
>>- ttm bo moving code should probably simplify a bit (and maybe more of the
>> code should be pushed as helpers into ttm)
>>- clflush code should be moved over to using USAGE_KERNEL and the various
>> hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>> expand on the kernel-doc for cache_dirty") for a bit more context
>>
>>This is still not yet enough, since if a vm_bind races with an eviction we
>>might stall on the new buffers being readied first before the context can
>>continue. This needs some care to make sure that vma which aren't fully
>>bound yet are on a separate list, and vma which are marked for unbinding
>>are removed from the main working set list as soon as possible.
>>
>>All of these things are relevant for the uapi semantics, which means
>>- they need to be documented in the uapi kerneldoc, ideally with example
>> flows
>>- umd need to ack this
>>
>
>Ok
>
>>The other thing here is the async/nonblocking path. I think we still need
>>that one, but again it should not sync with anything going on in execbuf,
>>but simply execute the ioctl code in a kernel thread. The idea here is
>>that this works like a special gpu engine, so that compute and vk can
>>schedule bindings interleaved with rendering. This should be enough to get
>>a performant vk sparse binding/textures implementation.
>>
>>But I'm not entirely sure on this one, so this definitely needs acks from
>>umds.
>>
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+A VM in VM_BIND mode will not support older execbuff mode of binding.
>>>+
>>>+UMDs can still send BOs of these persistent mappings in execlist of execbuff
>>>+for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>>>+but those BOs should be mapped ahead via vm_bind ioctl.
>>
>>should or must?
>>
>
>Must, will fix.
>
>>Also I'm not really sure that's a great interface. The batchbuffer really
>>only needs to be an address, so maybe all we need is an extension to
>>supply an u64 batchbuffer address instead of trying to retrofit this into
>>an unfitting current uapi.
>>
>
>Yah, this was considered, but was decided to do it as later optimization.
>But if we were to remove execlist entries completely (ie., no implicit
>sync also), then we need to do this from the beginning.
>
>>And for implicit sync there's two things:
>>- for vk I think the right uapi is the dma-buf fence import/export ioctls
>> from Jason Ekstrand. I think we should land that first instead of
>> hacking funny concepts together
>
>I did not understand fully, can you point to it?
>
>>- for gl the dma-buf import/export might not be fast enough, since gl
>> needs to do a _lot_ of implicit sync. There we might need to use the
>> execbuffer buffer list, but then we should have extremely clear uapi
>> rules which disallow _everything_ except setting the explicit sync uapi
>>
>
>Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
>
>>Again all this stuff needs to be documented in detail in the kerneldoc
>>uapi spec.
>>
>
>ok
>
>>>+VM_BIND features include,
>>>+- Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+- VA mapping can map to a partial section of the BO (partial binding).
>>>+- Support capture of persistent mappings in the dump upon GPU error.
>>>+- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  usecases will be helpful.
>>>+- Asynchronous vm_bind and vm_unbind support.
>>>+- VM_BIND uses user/memory fence mechanism for signaling bind completion
>>>+  and for signaling batch completion in long running contexts (explained
>>>+  below).
>>
>>This should all be in the kerneldoc.
>>
>
>ok
>
>>>+VM_PRIVATE objects
>>>+------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>Two things:
>>
>>- I think the above is required to for initial vm_bind for vk, it kinda
>> doesn't make much sense without that, and will allow us to match amdgpu
>> and radeonsi
>>
>>- Christian König just landed ttm bulk lru helpers, and I think we need to
>> use those. This means vm_bind will only work with the ttm backend, but
>> that's what we have for the big dgpu where vm_bind helps more in terms
>> of performance, and the igfx conversion to ttm is already going on.
>>
>
>ok
>
>>Furthermore the i915 shrinker lru has stopped being an lru, so I think
>>that should also be moved over to the ttm lru in some fashion to make sure
>>we once again have a reasonable and consistent memory aging and reclaim
>>architecture. The current code is just too much of a complete mess.
>>
>>And since this is all fairly integral to how the code arch works I don't
>>think merging a different version which isn't based on ttm bulk lru
>>helpers makes sense.
>>
>>Also I do think the page table lru handling needs to be included here,
>>because that's another complete hand-rolled separate world for not much
>>good reasons. I guess that can happen in parallel with the initial vm_bind
>>bring-up, but it needs to be completed by the time we add the features
>>beyond the initial support needed for vk.
>>
>
>Ok
>
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>>>+   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple pagefault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+
>>>+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>>>+   while binding a vma and while updating dma-resv fence list of a BO.
>>>+   The private BOs of a VM will all share a dma-resv object.
>>>+
>>>+   This lock is held in vm_bind call for immediate binding, during vm_unbind
>>>+   call for unbinding and during execbuff path for binding the mapping and
>>>+   updating the dma-resv fence list of the BO.
>>>+
>>>+3) Spinlock/s to protect some of the VM's lists.
>>>+
>>>+We will also need support for bluk LRU movement of persistent mapping to
>>>+avoid additional latencies in execbuff path.
>>
>>This needs more detail and explanation of how each level is required. Also
>>the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>
>>Like "some of the VM's lists" explains pretty much nothing.
>>
>
>Ok, will explain.
>
>>>+
>>>+GPU page faults
>>>+----------------
>>>+Both older execbuff mode and the newer VM_BIND mode of binding will require
>>>+using dma-fence to ensure residency.
>>>+In future when GPU page faults are supported, no dma-fence usage is required
>>>+as residency is purely managed by installing and removing/invalidating ptes.
>>
>>This is a bit confusing. I think one part of this should be moved into the
>>section with future vm_bind use-cases (we're not going to support page
>>faults with legacy softpin or even worse, relocations). The locking
>>discussion should be part of the much longer list of uses cases that
>>motivate the locking design.
>>
>
>Ok, will move.
>
>>>+
>>>+
>>>+User/Memory Fence
>>>+==================
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter.
>>>+
>>>+User/Memory fence is a <address, value> pair. To signal the user fence,
>>>+specified value will be written at the specified virtual address and
>>>+wakeup the waiting process. User can wait on an user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal an user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>
>>This is 1:1 tied to long running compute mode contexts (which in the uapi
>>doc must reference the endless amounts of bikeshed summary we have in the
>>docs about indefinite fences).
>>
>
>Ok, will check and add reference.
>
>>I'd put this into a new section about compute and userspace memory fences
>>support, with this and the next chapter ...
>
>ok
>
>>>+
>>>+
>>>+VM_BIND use cases
>>>+==================
>>
>>... and then make this section here focus entirely on additional vm_bind
>>use-cases that we'll be adding later on. Which doesn't need to go into any
>>details, it's just justification for why we want to build the world on top
>>of vm_bind.
>>
>
>ok
>
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence. Compute must opt-in for this mechanism with
>>>+I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>>>+
>>>+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>>>+and implicit dependency setting is not allowed on long running contexts.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context fence (called suspend fence) proxying as
>>>+i915_request fence. This suspend fence is enabled when there is a wait on it,
>>>+which triggers the context preemption.
>>>+
>>>+This is much easier to support with VM_BIND compared to the current heavier
>>>+execbuff path resource attachment.
>>
>>There's a bunch of tricky code around compute mode context support, like
>>the preempt ctx fence (or suspend fence or whatever you want to call it),
>>and the resume work. And I think that code should be shared across
>>drivers.
>>
>>I think the right place to put this is into drm/sched, somewhere attached
>>to the drm_sched_entity structure. I expect i915 folks to collaborate with
>>amd and ideally also get amdkfd to adopt the same thing if possible. At
>>least Christian has mentioned in the past that he's a bit unhappy about
>>how this works.
>>
>>Also drm/sched has dependency tracking, which will be needed to pipeline
>>context resume operations. That needs to be used instead of i915-gem
>>inventing yet another dependency tracking data structure (it already has 3
>>and that's roughly 3 too many).
>>
>>This means compute mode support and userspace memory fences are blocked on
>>the drm/sched conversion, but *eh* add it to the list of reasons for why
>>drm/sched needs to happen.
>>
>>Also since we only have support for compute mode ctx in our internal tree
>>with the guc scheduler backend anyway, and the first conversion target is
>>the guc backend, I don't think this actually holds up a lot of the code.
>>
>
>Hmm...ok. Currently, the context suspend and resume operations in out
>internal tree is through an orthogonal guc interface (not through scheduler).
>So, I need to look more into this part.
>
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>
>>This is really just a special case of compute mode contexts, I think I'd
>>include that in there, but explain better what it requires (i.e. vm_bind
>>not being synchronized against execbuf).
>>
>
>ok
>
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debuggee) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+Mesa/Valkun
>>>+------------
>>>+VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>>>+performance. For Vulkan it should be straightforward to use VM_BIND.
>>>+For Iris implicit buffer tracking must be implemented before we can harness
>>>+VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>>>+overhead becomes more important.
>>
>>Just to clarify, I don't think we can land vm_bind into upstream if it
>>doesn't work 100% for vk. There's a bit much "can" instead of "will in
>>this section".
>>
>
>ok, will explain better.
>
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface.
>>
>>Userptr is absent here (and it's not the same as svm, at least on
>>discrete), and this is needed for the initial version since otherwise vk
>>can't use it because we're not at feature parity.
>>
>
>userptr gem objects are supported in initial version (and yes it is not
>same as SVM). I did not add it here as there is no additional uapi
>change required to support that.
>
>>Irc discussions by Maarten and Dave came up with the idea that maybe
>>userptr for vm_bind should work _without_ any gem bo as backing storage,
>>since that guarantees that people don't come up with funny ideas like
>>trying to share such bo across process or mmap it and other nonsense which
>>just doesn't work.
>>
>
>Hmm...there is no plan to support userptr _without_ gem bo not atleast in
>the initial vm_bind support. Is it Ok to put it in the 'futues' section?
>
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+usecases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+  mapped objects. Page table pages are similar to persistent mappings of a
>>>+  VM (difference here are that the page table pages will not
>>>+  have an i915_vma structure and after swapping pages back in, parent page
>>>+  link needs to be updated).
>>
>>See above, but I think this should be included as part of the initial
>>vm_bind push.
>>
>
>Ok, as you mentioned above, we can do it soon after initial vm_bind support
>lands, but before we add any new vm_bind features.
>
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. Instead use underlying BO's
>>>+  dma-resv fence list to determine if a i915_vma is active or not.
>>
>>So this is a complete mess, and really should not exist. I think it needs
>>to be removed before we try to make i915_vma even more complex by adding
>>vm_bind.
>>
>
>Hmm...Need to look into this. I am not sure how much of an effort it is going
>to be to remove i915_vma active reference tracking and instead use dma_resv
>fences for activeness tracking.
>
>>The other thing I've been pondering here is that vm_bind is really
>>completely different from legacy vm structures for a lot of reasons:
>>- no relocation or softpin handling, which means vm_bind has no reason to
>> ever look at the i915_vma structure in execbuf code. Unfortunately
>> execbuf has been rewritten to be vma instead of obj centric, so it's a
>> 100% mismatch
>>
>>- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>> that because the kernel manages the virtual address space fully. Again
>> ideally that entire vma_move_to_active code and everything related to it
>> would simply not exist.
>>
>>- similar on the eviction side, the rules are quite different: For vm_bind
>> we never tear down the vma, instead it's just moved to the list of
>> evicted vma. Legacy vm have no need for all these additional lists, so
>> another huge confusion.
>>
>>- if the refcount is done correctly for vm_bind we wouldn't need the
>> tricky code in the bo close paths. Unfortunately legacy vm with
>> relocations and softpin require that vma are only a weak reference, so
>> that cannot be removed.
>>
>>- there's also a ton of special cases for ggtt handling, like the
>> different views (for display, partial views for mmap), but also the
>> gen2/3 alignment and padding requirements which vm_bind never needs.
>>
>>I think the right thing here is to massively split the implementation
>>behind some solid vm/vma abstraction, with a base clase for vm and vma
>>which _only_ has the pieces which both vm_bind and the legacy vm stuff
>>needs. But it's a bit tricky to get there. I think a workable path would
>>be:
>>- Add a new base class to both i915_address_space and i915_vma, which
>> starts out empty.
>>
>>- As vm_bind code lands, move things that vm_bind code needs into these
>> base classes
>>
>
>Ok
>
>>- The goal should be that these base classes are a stand-alone library
>> that other drivers could reuse. Like we've done with the buddy
>> allocator, which first moved from i915-gem to i915-ttm, and which amd
>> now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>> interested in adding something like vm_bind should be involved from the
>> start (or maybe the entire thing reused in amdgpu, they're looking at
>> vk sparse binding support too or at least have perf issues I think).
>>
>>- Locking must be the same across all implemntations, otherwise it's
>> really not an abstract. i915 screwed this up terribly by having
>> different locking rules for ppgtt and ggtt, which is just nonsense.
>>
>>- The legacy specific code needs to be extracted as much as possible and
>> shoved into separate files. In execbuf this means we need to get back to
>> object centric flow, and the slowpaths need to become a lot simpler
>> again (Maarten has cleaned up some of this, but there's still a silly
>> amount of hacks in there with funny layering).
>>
>
>This also, we can do soon after vm_bind code lands right?
>
>>- I think if stuff like the vma eviction details (list movement and
>> locking and refcounting of the underlying object)
>>
>>>+
>>>+These can be worked upon after intitial vm_bind support is added.
>>
>>I don't think that works, given how badly i915-gem team screwed up in
>>other places. And those places had to be fixed by adopting shared code
>>like ttm. Plus there's already a huge unfulffiled promise pending with the
>>drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>
>
>Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
>active reference tracking code from i915 driver. Wonder if there is any
>middle ground here like not using that in vm_bind mode?
>
>Niranjana
>
>>Cheers, Daniel
>>
>>>+
>>>+
>>>+UAPI
>>>+=====
>>>+Uapi definiton can be found here:
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>>--
>>>2.21.0.rc0.32.g243a4c7e27
>>>
>>
>>--
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-04-27 15:41         ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-04-27 15:41 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, intel-gfx, dri-devel, chris.p.wilson, daniel.vetter,
	thomas.hellstrom, Christian König, Ben Skeggs

On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
>>Adding a pile of people who've expressed interest in vm_bind for their
>>drivers.
>>
>>Also note to the intel folks: This is largely written with me having my
>>subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>>here for the subsystem at large. There is substantial rework involved
>>here, but it's not any different from i915 adopting ttm or i915 adpoting
>>drm/sched, and I do think this stuff needs to happen in one form or
>>another.
>>
>>On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>> Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>> Documentation/gpu/rfc/index.rst        |   4 +
>>> 2 files changed, 214 insertions(+)
>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..cdc6bb25b942
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,210 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>>>+a specified address space (VM).
>>>+
>>>+These mappings (also referred to as persistent mappings) will be persistent
>>>+across multiple GPU submissions (execbuff) issued by the UMD, without user
>>>+having to provide a list of all required mappings during each submission
>>>+(as required by older execbuff mode).
>>>+
>>>+VM_BIND ioctl deferes binding the mappings until next execbuff submission
>>>+where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>>>+flag is set (useful if mapping is required for an active context).
>>
>>So this is a screw-up I've done, and for upstream I think we need to fix
>>it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>>I was wrong suggesting we should do this a few years back when we kicked
>>this off internally :-(
>>
>>What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>>things on top:
>>- in and out fences, like with execbuf, to allow userspace to sync with
>> execbuf as needed
>>- for compute-mode context this means userspace memory fences
>>- for legacy context this means a timeline syncobj in drm_syncobj
>>
>>No sync_file or anything else like this at all. This means a bunch of
>>work, but also it'll have benefits because it means we should be able to
>>use exactly the same code paths and logic for both compute and for legacy
>>context, because drm_syncobj support future fence semantics.
>>
>
>Thanks Daniel,
>Ok, will update
>

I had a long conversation with Daniel on some of the points discussed here.
Thanks to Daniel for clarifying many points here.

Here is the summary of the discussion.

1) A prep patch is needed to update documentation of some existing uapi and this
   new VM_BIND uapi can update/refer to that.
   I will include this prep patch in the next revision of this RFC series.
   Will also include the uapi header file in the rst file so that it gets rendered.

2) Will update documentation here with proper use of dma_resv_usage while adding
   fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default
   if not override with execlist in execbuff path.

3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
   instead of having to pass it as a BO handle in execlist. This will also make the
   execlist usage solely for implicit sync setting which is further discussed below.

4) Need to look into when will Jason's dma-buf fence import/export ioctl support will
   land and whether it will be used both for vl and gl. Need to sync with Jason on this.
   Probably the better option here would be to not support execlist in execbuff path in
   vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export
   ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode
   later if required (say for gl).

5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
   relocations, implicit sync etc). Separate them out by using function pointers wherever
   the functionality differs between current design and the newer VM_BIND design.

6) Separate out i915_vma active reference counting in execbuff path and do not use it in
   VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier
   to get it working with the current TTM backend (which initial VM_BIND support will use).
   And remove i915_vma active reference counting fully while supporting TTM backend for igfx.

7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires
   support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler
   conversion.

Will revise this series accordingly.

Thanks,
Niranjana

>>Also on the implementation side we still need to install dma_fence to the
>>various dma_resv, and for this we need the new dma_resv_usage series from
>>Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>>flag to make sure they never result in an oversync issue with execbuf. I
>>don't think trying to land vm_bind without that prep work in
>>dma_resv_usage makes sense.
>>
>
>Ok, but that is not a dependency for this VM_BIND design RFC patch right?
>I will add this to the documentation here.
>
>>Also as soon as dma_resv_usage has landed there's a few cleanups we should
>>do in i915:
>>- ttm bo moving code should probably simplify a bit (and maybe more of the
>> code should be pushed as helpers into ttm)
>>- clflush code should be moved over to using USAGE_KERNEL and the various
>> hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>> expand on the kernel-doc for cache_dirty") for a bit more context
>>
>>This is still not yet enough, since if a vm_bind races with an eviction we
>>might stall on the new buffers being readied first before the context can
>>continue. This needs some care to make sure that vma which aren't fully
>>bound yet are on a separate list, and vma which are marked for unbinding
>>are removed from the main working set list as soon as possible.
>>
>>All of these things are relevant for the uapi semantics, which means
>>- they need to be documented in the uapi kerneldoc, ideally with example
>> flows
>>- umd need to ack this
>>
>
>Ok
>
>>The other thing here is the async/nonblocking path. I think we still need
>>that one, but again it should not sync with anything going on in execbuf,
>>but simply execute the ioctl code in a kernel thread. The idea here is
>>that this works like a special gpu engine, so that compute and vk can
>>schedule bindings interleaved with rendering. This should be enough to get
>>a performant vk sparse binding/textures implementation.
>>
>>But I'm not entirely sure on this one, so this definitely needs acks from
>>umds.
>>
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+A VM in VM_BIND mode will not support older execbuff mode of binding.
>>>+
>>>+UMDs can still send BOs of these persistent mappings in execlist of execbuff
>>>+for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>>>+but those BOs should be mapped ahead via vm_bind ioctl.
>>
>>should or must?
>>
>
>Must, will fix.
>
>>Also I'm not really sure that's a great interface. The batchbuffer really
>>only needs to be an address, so maybe all we need is an extension to
>>supply an u64 batchbuffer address instead of trying to retrofit this into
>>an unfitting current uapi.
>>
>
>Yah, this was considered, but was decided to do it as later optimization.
>But if we were to remove execlist entries completely (ie., no implicit
>sync also), then we need to do this from the beginning.
>
>>And for implicit sync there's two things:
>>- for vk I think the right uapi is the dma-buf fence import/export ioctls
>> from Jason Ekstrand. I think we should land that first instead of
>> hacking funny concepts together
>
>I did not understand fully, can you point to it?
>
>>- for gl the dma-buf import/export might not be fast enough, since gl
>> needs to do a _lot_ of implicit sync. There we might need to use the
>> execbuffer buffer list, but then we should have extremely clear uapi
>> rules which disallow _everything_ except setting the explicit sync uapi
>>
>
>Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
>
>>Again all this stuff needs to be documented in detail in the kerneldoc
>>uapi spec.
>>
>
>ok
>
>>>+VM_BIND features include,
>>>+- Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+- VA mapping can map to a partial section of the BO (partial binding).
>>>+- Support capture of persistent mappings in the dump upon GPU error.
>>>+- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  usecases will be helpful.
>>>+- Asynchronous vm_bind and vm_unbind support.
>>>+- VM_BIND uses user/memory fence mechanism for signaling bind completion
>>>+  and for signaling batch completion in long running contexts (explained
>>>+  below).
>>
>>This should all be in the kerneldoc.
>>
>
>ok
>
>>>+VM_PRIVATE objects
>>>+------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>Two things:
>>
>>- I think the above is required to for initial vm_bind for vk, it kinda
>> doesn't make much sense without that, and will allow us to match amdgpu
>> and radeonsi
>>
>>- Christian König just landed ttm bulk lru helpers, and I think we need to
>> use those. This means vm_bind will only work with the ttm backend, but
>> that's what we have for the big dgpu where vm_bind helps more in terms
>> of performance, and the igfx conversion to ttm is already going on.
>>
>
>ok
>
>>Furthermore the i915 shrinker lru has stopped being an lru, so I think
>>that should also be moved over to the ttm lru in some fashion to make sure
>>we once again have a reasonable and consistent memory aging and reclaim
>>architecture. The current code is just too much of a complete mess.
>>
>>And since this is all fairly integral to how the code arch works I don't
>>think merging a different version which isn't based on ttm bulk lru
>>helpers makes sense.
>>
>>Also I do think the page table lru handling needs to be included here,
>>because that's another complete hand-rolled separate world for not much
>>good reasons. I guess that can happen in parallel with the initial vm_bind
>>bring-up, but it needs to be completed by the time we add the features
>>beyond the initial support needed for vk.
>>
>
>Ok
>
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>>>+   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple pagefault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+
>>>+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>>>+   while binding a vma and while updating dma-resv fence list of a BO.
>>>+   The private BOs of a VM will all share a dma-resv object.
>>>+
>>>+   This lock is held in vm_bind call for immediate binding, during vm_unbind
>>>+   call for unbinding and during execbuff path for binding the mapping and
>>>+   updating the dma-resv fence list of the BO.
>>>+
>>>+3) Spinlock/s to protect some of the VM's lists.
>>>+
>>>+We will also need support for bluk LRU movement of persistent mapping to
>>>+avoid additional latencies in execbuff path.
>>
>>This needs more detail and explanation of how each level is required. Also
>>the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>
>>Like "some of the VM's lists" explains pretty much nothing.
>>
>
>Ok, will explain.
>
>>>+
>>>+GPU page faults
>>>+----------------
>>>+Both older execbuff mode and the newer VM_BIND mode of binding will require
>>>+using dma-fence to ensure residency.
>>>+In future when GPU page faults are supported, no dma-fence usage is required
>>>+as residency is purely managed by installing and removing/invalidating ptes.
>>
>>This is a bit confusing. I think one part of this should be moved into the
>>section with future vm_bind use-cases (we're not going to support page
>>faults with legacy softpin or even worse, relocations). The locking
>>discussion should be part of the much longer list of uses cases that
>>motivate the locking design.
>>
>
>Ok, will move.
>
>>>+
>>>+
>>>+User/Memory Fence
>>>+==================
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter.
>>>+
>>>+User/Memory fence is a <address, value> pair. To signal the user fence,
>>>+specified value will be written at the specified virtual address and
>>>+wakeup the waiting process. User can wait on an user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal an user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>
>>This is 1:1 tied to long running compute mode contexts (which in the uapi
>>doc must reference the endless amounts of bikeshed summary we have in the
>>docs about indefinite fences).
>>
>
>Ok, will check and add reference.
>
>>I'd put this into a new section about compute and userspace memory fences
>>support, with this and the next chapter ...
>
>ok
>
>>>+
>>>+
>>>+VM_BIND use cases
>>>+==================
>>
>>... and then make this section here focus entirely on additional vm_bind
>>use-cases that we'll be adding later on. Which doesn't need to go into any
>>details, it's just justification for why we want to build the world on top
>>of vm_bind.
>>
>
>ok
>
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence. Compute must opt-in for this mechanism with
>>>+I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>>>+
>>>+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>>>+and implicit dependency setting is not allowed on long running contexts.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context fence (called suspend fence) proxying as
>>>+i915_request fence. This suspend fence is enabled when there is a wait on it,
>>>+which triggers the context preemption.
>>>+
>>>+This is much easier to support with VM_BIND compared to the current heavier
>>>+execbuff path resource attachment.
>>
>>There's a bunch of tricky code around compute mode context support, like
>>the preempt ctx fence (or suspend fence or whatever you want to call it),
>>and the resume work. And I think that code should be shared across
>>drivers.
>>
>>I think the right place to put this is into drm/sched, somewhere attached
>>to the drm_sched_entity structure. I expect i915 folks to collaborate with
>>amd and ideally also get amdkfd to adopt the same thing if possible. At
>>least Christian has mentioned in the past that he's a bit unhappy about
>>how this works.
>>
>>Also drm/sched has dependency tracking, which will be needed to pipeline
>>context resume operations. That needs to be used instead of i915-gem
>>inventing yet another dependency tracking data structure (it already has 3
>>and that's roughly 3 too many).
>>
>>This means compute mode support and userspace memory fences are blocked on
>>the drm/sched conversion, but *eh* add it to the list of reasons for why
>>drm/sched needs to happen.
>>
>>Also since we only have support for compute mode ctx in our internal tree
>>with the guc scheduler backend anyway, and the first conversion target is
>>the guc backend, I don't think this actually holds up a lot of the code.
>>
>
>Hmm...ok. Currently, the context suspend and resume operations in out
>internal tree is through an orthogonal guc interface (not through scheduler).
>So, I need to look more into this part.
>
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>
>>This is really just a special case of compute mode contexts, I think I'd
>>include that in there, but explain better what it requires (i.e. vm_bind
>>not being synchronized against execbuf).
>>
>
>ok
>
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debuggee) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+Mesa/Valkun
>>>+------------
>>>+VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>>>+performance. For Vulkan it should be straightforward to use VM_BIND.
>>>+For Iris implicit buffer tracking must be implemented before we can harness
>>>+VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>>>+overhead becomes more important.
>>
>>Just to clarify, I don't think we can land vm_bind into upstream if it
>>doesn't work 100% for vk. There's a bit much "can" instead of "will in
>>this section".
>>
>
>ok, will explain better.
>
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface.
>>
>>Userptr is absent here (and it's not the same as svm, at least on
>>discrete), and this is needed for the initial version since otherwise vk
>>can't use it because we're not at feature parity.
>>
>
>userptr gem objects are supported in initial version (and yes it is not
>same as SVM). I did not add it here as there is no additional uapi
>change required to support that.
>
>>Irc discussions by Maarten and Dave came up with the idea that maybe
>>userptr for vm_bind should work _without_ any gem bo as backing storage,
>>since that guarantees that people don't come up with funny ideas like
>>trying to share such bo across process or mmap it and other nonsense which
>>just doesn't work.
>>
>
>Hmm...there is no plan to support userptr _without_ gem bo not atleast in
>the initial vm_bind support. Is it Ok to put it in the 'futues' section?
>
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+usecases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+  mapped objects. Page table pages are similar to persistent mappings of a
>>>+  VM (difference here are that the page table pages will not
>>>+  have an i915_vma structure and after swapping pages back in, parent page
>>>+  link needs to be updated).
>>
>>See above, but I think this should be included as part of the initial
>>vm_bind push.
>>
>
>Ok, as you mentioned above, we can do it soon after initial vm_bind support
>lands, but before we add any new vm_bind features.
>
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. Instead use underlying BO's
>>>+  dma-resv fence list to determine if a i915_vma is active or not.
>>
>>So this is a complete mess, and really should not exist. I think it needs
>>to be removed before we try to make i915_vma even more complex by adding
>>vm_bind.
>>
>
>Hmm...Need to look into this. I am not sure how much of an effort it is going
>to be to remove i915_vma active reference tracking and instead use dma_resv
>fences for activeness tracking.
>
>>The other thing I've been pondering here is that vm_bind is really
>>completely different from legacy vm structures for a lot of reasons:
>>- no relocation or softpin handling, which means vm_bind has no reason to
>> ever look at the i915_vma structure in execbuf code. Unfortunately
>> execbuf has been rewritten to be vma instead of obj centric, so it's a
>> 100% mismatch
>>
>>- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>> that because the kernel manages the virtual address space fully. Again
>> ideally that entire vma_move_to_active code and everything related to it
>> would simply not exist.
>>
>>- similar on the eviction side, the rules are quite different: For vm_bind
>> we never tear down the vma, instead it's just moved to the list of
>> evicted vma. Legacy vm have no need for all these additional lists, so
>> another huge confusion.
>>
>>- if the refcount is done correctly for vm_bind we wouldn't need the
>> tricky code in the bo close paths. Unfortunately legacy vm with
>> relocations and softpin require that vma are only a weak reference, so
>> that cannot be removed.
>>
>>- there's also a ton of special cases for ggtt handling, like the
>> different views (for display, partial views for mmap), but also the
>> gen2/3 alignment and padding requirements which vm_bind never needs.
>>
>>I think the right thing here is to massively split the implementation
>>behind some solid vm/vma abstraction, with a base clase for vm and vma
>>which _only_ has the pieces which both vm_bind and the legacy vm stuff
>>needs. But it's a bit tricky to get there. I think a workable path would
>>be:
>>- Add a new base class to both i915_address_space and i915_vma, which
>> starts out empty.
>>
>>- As vm_bind code lands, move things that vm_bind code needs into these
>> base classes
>>
>
>Ok
>
>>- The goal should be that these base classes are a stand-alone library
>> that other drivers could reuse. Like we've done with the buddy
>> allocator, which first moved from i915-gem to i915-ttm, and which amd
>> now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>> interested in adding something like vm_bind should be involved from the
>> start (or maybe the entire thing reused in amdgpu, they're looking at
>> vk sparse binding support too or at least have perf issues I think).
>>
>>- Locking must be the same across all implemntations, otherwise it's
>> really not an abstract. i915 screwed this up terribly by having
>> different locking rules for ppgtt and ggtt, which is just nonsense.
>>
>>- The legacy specific code needs to be extracted as much as possible and
>> shoved into separate files. In execbuf this means we need to get back to
>> object centric flow, and the slowpaths need to become a lot simpler
>> again (Maarten has cleaned up some of this, but there's still a silly
>> amount of hacks in there with funny layering).
>>
>
>This also, we can do soon after vm_bind code lands right?
>
>>- I think if stuff like the vma eviction details (list movement and
>> locking and refcounting of the underlying object)
>>
>>>+
>>>+These can be worked upon after intitial vm_bind support is added.
>>
>>I don't think that works, given how badly i915-gem team screwed up in
>>other places. And those places had to be fixed by adopting shared code
>>like ttm. Plus there's already a huge unfulffiled promise pending with the
>>drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>
>
>Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
>active reference tracking code from i915 driver. Wonder if there is any
>middle ground here like not using that in vm_bind mode?
>
>Niranjana
>
>>Cheers, Daniel
>>
>>>+
>>>+
>>>+UAPI
>>>+=====
>>>+Uapi definiton can be found here:
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>>--
>>>2.21.0.rc0.32.g243a4c7e27
>>>
>>
>>--
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-04-27 15:41         ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-04-28 12:29           ` Daniel Vetter
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-04-28 12:29 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Jason Ekstrand, Dave Airlie, intel-gfx, dri-devel, Bloomfield,
	Jon, chris.p.wilson, daniel.vetter, thomas.hellstrom,
	Christian König, Ben Skeggs

On Wed, Apr 27, 2022 at 08:41:35AM -0700, Niranjana Vishwanathapura wrote:
> On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
> > On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
> > > Adding a pile of people who've expressed interest in vm_bind for their
> > > drivers.
> > > 
> > > Also note to the intel folks: This is largely written with me having my
> > > subsystem co-maintainer hat on, i.e. what I think is the right thing to do
> > > here for the subsystem at large. There is substantial rework involved
> > > here, but it's not any different from i915 adopting ttm or i915 adpoting
> > > drm/sched, and I do think this stuff needs to happen in one form or
> > > another.
> > > 
> > > On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> > > > VM_BIND design document with description of intended use cases.
> > > > 
> > > > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > > ---
> > > > Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
> > > > Documentation/gpu/rfc/index.rst        |   4 +
> > > > 2 files changed, 214 insertions(+)
> > > > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> > > > 
> > > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > new file mode 100644
> > > > index 000000000000..cdc6bb25b942
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > @@ -0,0 +1,210 @@
> > > > +==========================================
> > > > +I915 VM_BIND feature design and use cases
> > > > +==========================================
> > > > +
> > > > +VM_BIND feature
> > > > +================
> > > > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> > > > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> > > > +a specified address space (VM).
> > > > +
> > > > +These mappings (also referred to as persistent mappings) will be persistent
> > > > +across multiple GPU submissions (execbuff) issued by the UMD, without user
> > > > +having to provide a list of all required mappings during each submission
> > > > +(as required by older execbuff mode).
> > > > +
> > > > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> > > > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> > > > +flag is set (useful if mapping is required for an active context).
> > > 
> > > So this is a screw-up I've done, and for upstream I think we need to fix
> > > it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
> > > I was wrong suggesting we should do this a few years back when we kicked
> > > this off internally :-(
> > > 
> > > What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
> > > things on top:
> > > - in and out fences, like with execbuf, to allow userspace to sync with
> > > execbuf as needed
> > > - for compute-mode context this means userspace memory fences
> > > - for legacy context this means a timeline syncobj in drm_syncobj
> > > 
> > > No sync_file or anything else like this at all. This means a bunch of
> > > work, but also it'll have benefits because it means we should be able to
> > > use exactly the same code paths and logic for both compute and for legacy
> > > context, because drm_syncobj support future fence semantics.
> > > 
> > 
> > Thanks Daniel,
> > Ok, will update
> > 
> 
> I had a long conversation with Daniel on some of the points discussed here.
> Thanks to Daniel for clarifying many points here.
> 
> Here is the summary of the discussion.
> 
> 1) A prep patch is needed to update documentation of some existing uapi and this
>   new VM_BIND uapi can update/refer to that.
>   I will include this prep patch in the next revision of this RFC series.
>   Will also include the uapi header file in the rst file so that it gets rendered.
> 
> 2) Will update documentation here with proper use of dma_resv_usage while adding
>   fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default
>   if not override with execlist in execbuff path.
> 
> 3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
>   instead of having to pass it as a BO handle in execlist. This will also make the
>   execlist usage solely for implicit sync setting which is further discussed below.
> 
> 4) Need to look into when will Jason's dma-buf fence import/export ioctl support will
>   land and whether it will be used both for vl and gl. Need to sync with Jason on this.
>   Probably the better option here would be to not support execlist in execbuff path in
>   vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export
>   ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode
>   later if required (say for gl).

So I'm again less sure whether the import/export ioctl is the right thing
for gl, but I still think we should try. The reason is that we really
need to set the implicit sync set per execbuf, otherwise there's oversync
issues. So one of the ideas we've discussed where the implicit sync set
would be controlled through vm_bind doesn't work for gl.
-Daniel

> 
> 5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
>   relocations, implicit sync etc). Separate them out by using function pointers wherever
>   the functionality differs between current design and the newer VM_BIND design.
> 
> 6) Separate out i915_vma active reference counting in execbuff path and do not use it in
>   VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier
>   to get it working with the current TTM backend (which initial VM_BIND support will use).
>   And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
> 
> 7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires
>   support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler
>   conversion.
> 
> Will revise this series accordingly.
> 
> Thanks,
> Niranjana
> 
> > > Also on the implementation side we still need to install dma_fence to the
> > > various dma_resv, and for this we need the new dma_resv_usage series from
> > > Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
> > > flag to make sure they never result in an oversync issue with execbuf. I
> > > don't think trying to land vm_bind without that prep work in
> > > dma_resv_usage makes sense.
> > > 
> > 
> > Ok, but that is not a dependency for this VM_BIND design RFC patch right?
> > I will add this to the documentation here.
> > 
> > > Also as soon as dma_resv_usage has landed there's a few cleanups we should
> > > do in i915:
> > > - ttm bo moving code should probably simplify a bit (and maybe more of the
> > > code should be pushed as helpers into ttm)
> > > - clflush code should be moved over to using USAGE_KERNEL and the various
> > > hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
> > > expand on the kernel-doc for cache_dirty") for a bit more context
> > > 
> > > This is still not yet enough, since if a vm_bind races with an eviction we
> > > might stall on the new buffers being readied first before the context can
> > > continue. This needs some care to make sure that vma which aren't fully
> > > bound yet are on a separate list, and vma which are marked for unbinding
> > > are removed from the main working set list as soon as possible.
> > > 
> > > All of these things are relevant for the uapi semantics, which means
> > > - they need to be documented in the uapi kerneldoc, ideally with example
> > > flows
> > > - umd need to ack this
> > > 
> > 
> > Ok
> > 
> > > The other thing here is the async/nonblocking path. I think we still need
> > > that one, but again it should not sync with anything going on in execbuf,
> > > but simply execute the ioctl code in a kernel thread. The idea here is
> > > that this works like a special gpu engine, so that compute and vk can
> > > schedule bindings interleaved with rendering. This should be enough to get
> > > a performant vk sparse binding/textures implementation.
> > > 
> > > But I'm not entirely sure on this one, so this definitely needs acks from
> > > umds.
> > > 
> > > > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> > > > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> > > > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> > > > +A VM in VM_BIND mode will not support older execbuff mode of binding.
> > > > +
> > > > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> > > > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> > > > +but those BOs should be mapped ahead via vm_bind ioctl.
> > > 
> > > should or must?
> > > 
> > 
> > Must, will fix.
> > 
> > > Also I'm not really sure that's a great interface. The batchbuffer really
> > > only needs to be an address, so maybe all we need is an extension to
> > > supply an u64 batchbuffer address instead of trying to retrofit this into
> > > an unfitting current uapi.
> > > 
> > 
> > Yah, this was considered, but was decided to do it as later optimization.
> > But if we were to remove execlist entries completely (ie., no implicit
> > sync also), then we need to do this from the beginning.
> > 
> > > And for implicit sync there's two things:
> > > - for vk I think the right uapi is the dma-buf fence import/export ioctls
> > > from Jason Ekstrand. I think we should land that first instead of
> > > hacking funny concepts together
> > 
> > I did not understand fully, can you point to it?
> > 
> > > - for gl the dma-buf import/export might not be fast enough, since gl
> > > needs to do a _lot_ of implicit sync. There we might need to use the
> > > execbuffer buffer list, but then we should have extremely clear uapi
> > > rules which disallow _everything_ except setting the explicit sync uapi
> > > 
> > 
> > Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
> > 
> > > Again all this stuff needs to be documented in detail in the kerneldoc
> > > uapi spec.
> > > 
> > 
> > ok
> > 
> > > > +VM_BIND features include,
> > > > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> > > > +  of an object (aliasing).
> > > > +- VA mapping can map to a partial section of the BO (partial binding).
> > > > +- Support capture of persistent mappings in the dump upon GPU error.
> > > > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> > > > +  usecases will be helpful.
> > > > +- Asynchronous vm_bind and vm_unbind support.
> > > > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> > > > +  and for signaling batch completion in long running contexts (explained
> > > > +  below).
> > > 
> > > This should all be in the kerneldoc.
> > > 
> > 
> > ok
> > 
> > > > +VM_PRIVATE objects
> > > > +------------------
> > > > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> > > > +exported. Hence these BOs are referred to as Shared BOs.
> > > > +During each execbuff submission, the request fence must be added to the
> > > > +dma-resv fence list of all shared BOs mapped on the VM.
> > > > +
> > > > +VM_BIND feature introduces an optimization where user can create BO which
> > > > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> > > > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> > > > +the VM they are private to and can't be dma-buf exported.
> > > > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> > > > +submission, they need only one dma-resv fence list updated. Thus the fast
> > > > +path (where required mappings are already bound) submission latency is O(1)
> > > > +w.r.t the number of VM private BOs.
> > > 
> > > Two things:
> > > 
> > > - I think the above is required to for initial vm_bind for vk, it kinda
> > > doesn't make much sense without that, and will allow us to match amdgpu
> > > and radeonsi
> > > 
> > > - Christian König just landed ttm bulk lru helpers, and I think we need to
> > > use those. This means vm_bind will only work with the ttm backend, but
> > > that's what we have for the big dgpu where vm_bind helps more in terms
> > > of performance, and the igfx conversion to ttm is already going on.
> > > 
> > 
> > ok
> > 
> > > Furthermore the i915 shrinker lru has stopped being an lru, so I think
> > > that should also be moved over to the ttm lru in some fashion to make sure
> > > we once again have a reasonable and consistent memory aging and reclaim
> > > architecture. The current code is just too much of a complete mess.
> > > 
> > > And since this is all fairly integral to how the code arch works I don't
> > > think merging a different version which isn't based on ttm bulk lru
> > > helpers makes sense.
> > > 
> > > Also I do think the page table lru handling needs to be included here,
> > > because that's another complete hand-rolled separate world for not much
> > > good reasons. I guess that can happen in parallel with the initial vm_bind
> > > bring-up, but it needs to be completed by the time we add the features
> > > beyond the initial support needed for vk.
> > > 
> > 
> > Ok
> > 
> > > > +VM_BIND locking hirarchy
> > > > +-------------------------
> > > > +VM_BIND locking order is as below.
> > > > +
> > > > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> > > > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> > > > +
> > > > +   In future, when GPU page faults are supported, we can potentially use a
> > > > +   rwsem instead, so that multiple pagefault handlers can take the read side
> > > > +   lock to lookup the mapping and hence can run in parallel.
> > > > +
> > > > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> > > > +   while binding a vma and while updating dma-resv fence list of a BO.
> > > > +   The private BOs of a VM will all share a dma-resv object.
> > > > +
> > > > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> > > > +   call for unbinding and during execbuff path for binding the mapping and
> > > > +   updating the dma-resv fence list of the BO.
> > > > +
> > > > +3) Spinlock/s to protect some of the VM's lists.
> > > > +
> > > > +We will also need support for bluk LRU movement of persistent mapping to
> > > > +avoid additional latencies in execbuff path.
> > > 
> > > This needs more detail and explanation of how each level is required. Also
> > > the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
> > > 
> > > Like "some of the VM's lists" explains pretty much nothing.
> > > 
> > 
> > Ok, will explain.
> > 
> > > > +
> > > > +GPU page faults
> > > > +----------------
> > > > +Both older execbuff mode and the newer VM_BIND mode of binding will require
> > > > +using dma-fence to ensure residency.
> > > > +In future when GPU page faults are supported, no dma-fence usage is required
> > > > +as residency is purely managed by installing and removing/invalidating ptes.
> > > 
> > > This is a bit confusing. I think one part of this should be moved into the
> > > section with future vm_bind use-cases (we're not going to support page
> > > faults with legacy softpin or even worse, relocations). The locking
> > > discussion should be part of the much longer list of uses cases that
> > > motivate the locking design.
> > > 
> > 
> > Ok, will move.
> > 
> > > > +
> > > > +
> > > > +User/Memory Fence
> > > > +==================
> > > > +The idea is to take a user specified virtual address and install an interrupt
> > > > +handler to wake up the current task when the memory location passes the user
> > > > +supplied filter.
> > > > +
> > > > +User/Memory fence is a <address, value> pair. To signal the user fence,
> > > > +specified value will be written at the specified virtual address and
> > > > +wakeup the waiting process. User can wait on an user fence with the
> > > > +gem_wait_user_fence ioctl.
> > > > +
> > > > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> > > > +interrupt within their batches after updating the value to have sub-batch
> > > > +precision on the wakeup. Each batch can signal an user fence to indicate
> > > > +the completion of next level batch. The completion of very first level batch
> > > > +needs to be signaled by the command streamer. The user must provide the
> > > > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> > > > +extension of execbuff ioctl, so that KMD can setup the command streamer to
> > > > +signal it.
> > > > +
> > > > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> > > > +the user process after completion of an asynchronous operation.
> > > > +
> > > > +When VM_BIND ioctl was provided with a user/memory fence via the
> > > > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> > > > +of binding of that mapping. All async binds/unbinds are serialized, hence
> > > > +signaling of user/memory fence also indicate the completion of all previous
> > > > +binds/unbinds.
> > > > +
> > > > +This feature will be derived from the below original work:
> > > > +https://patchwork.freedesktop.org/patch/349417/
> > > 
> > > This is 1:1 tied to long running compute mode contexts (which in the uapi
> > > doc must reference the endless amounts of bikeshed summary we have in the
> > > docs about indefinite fences).
> > > 
> > 
> > Ok, will check and add reference.
> > 
> > > I'd put this into a new section about compute and userspace memory fences
> > > support, with this and the next chapter ...
> > 
> > ok
> > 
> > > > +
> > > > +
> > > > +VM_BIND use cases
> > > > +==================
> > > 
> > > ... and then make this section here focus entirely on additional vm_bind
> > > use-cases that we'll be adding later on. Which doesn't need to go into any
> > > details, it's just justification for why we want to build the world on top
> > > of vm_bind.
> > > 
> > 
> > ok
> > 
> > > > +
> > > > +Long running Compute contexts
> > > > +------------------------------
> > > > +Usage of dma-fence expects that they complete in reasonable amount of time.
> > > > +Compute on the other hand can be long running. Hence it is appropriate for
> > > > +compute to use user/memory fence and dma-fence usage will be limited to
> > > > +in-kernel consumption only. This requires an execbuff uapi extension to pass
> > > > +in user fence. Compute must opt-in for this mechanism with
> > > > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> > > > +
> > > > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> > > > +and implicit dependency setting is not allowed on long running contexts.
> > > > +
> > > > +Where GPU page faults are not available, kernel driver upon buffer invalidation
> > > > +will initiate a suspend (preemption) of long running context with a dma-fence
> > > > +attached to it. And upon completion of that suspend fence, finish the
> > > > +invalidation, revalidate the BO and then resume the compute context. This is
> > > > +done by having a per-context fence (called suspend fence) proxying as
> > > > +i915_request fence. This suspend fence is enabled when there is a wait on it,
> > > > +which triggers the context preemption.
> > > > +
> > > > +This is much easier to support with VM_BIND compared to the current heavier
> > > > +execbuff path resource attachment.
> > > 
> > > There's a bunch of tricky code around compute mode context support, like
> > > the preempt ctx fence (or suspend fence or whatever you want to call it),
> > > and the resume work. And I think that code should be shared across
> > > drivers.
> > > 
> > > I think the right place to put this is into drm/sched, somewhere attached
> > > to the drm_sched_entity structure. I expect i915 folks to collaborate with
> > > amd and ideally also get amdkfd to adopt the same thing if possible. At
> > > least Christian has mentioned in the past that he's a bit unhappy about
> > > how this works.
> > > 
> > > Also drm/sched has dependency tracking, which will be needed to pipeline
> > > context resume operations. That needs to be used instead of i915-gem
> > > inventing yet another dependency tracking data structure (it already has 3
> > > and that's roughly 3 too many).
> > > 
> > > This means compute mode support and userspace memory fences are blocked on
> > > the drm/sched conversion, but *eh* add it to the list of reasons for why
> > > drm/sched needs to happen.
> > > 
> > > Also since we only have support for compute mode ctx in our internal tree
> > > with the guc scheduler backend anyway, and the first conversion target is
> > > the guc backend, I don't think this actually holds up a lot of the code.
> > > 
> > 
> > Hmm...ok. Currently, the context suspend and resume operations in out
> > internal tree is through an orthogonal guc interface (not through scheduler).
> > So, I need to look more into this part.
> > 
> > > > +Low Latency Submission
> > > > +-----------------------
> > > > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> > > > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
> > > 
> > > This is really just a special case of compute mode contexts, I think I'd
> > > include that in there, but explain better what it requires (i.e. vm_bind
> > > not being synchronized against execbuf).
> > > 
> > 
> > ok
> > 
> > > > +
> > > > +Debugger
> > > > +---------
> > > > +With debug event interface user space process (debugger) is able to keep track
> > > > +of and act upon resources created by another process (debuggee) and attached
> > > > +to GPU via vm_bind interface.
> > > > +
> > > > +Mesa/Valkun
> > > > +------------
> > > > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> > > > +performance. For Vulkan it should be straightforward to use VM_BIND.
> > > > +For Iris implicit buffer tracking must be implemented before we can harness
> > > > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> > > > +overhead becomes more important.
> > > 
> > > Just to clarify, I don't think we can land vm_bind into upstream if it
> > > doesn't work 100% for vk. There's a bit much "can" instead of "will in
> > > this section".
> > > 
> > 
> > ok, will explain better.
> > 
> > > > +
> > > > +Page level hints settings
> > > > +--------------------------
> > > > +VM_BIND allows any hints setting per mapping instead of per BO.
> > > > +Possible hints include read-only, placement and atomicity.
> > > > +Sub-BO level placement hint will be even more relevant with
> > > > +upcoming GPU on-demand page fault support.
> > > > +
> > > > +Page level Cache/CLOS settings
> > > > +-------------------------------
> > > > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> > > > +
> > > > +Shared Virtual Memory (SVM) support
> > > > +------------------------------------
> > > > +VM_BIND interface can be used to map system memory directly (without gem BO
> > > > +abstraction) using the HMM interface.
> > > 
> > > Userptr is absent here (and it's not the same as svm, at least on
> > > discrete), and this is needed for the initial version since otherwise vk
> > > can't use it because we're not at feature parity.
> > > 
> > 
> > userptr gem objects are supported in initial version (and yes it is not
> > same as SVM). I did not add it here as there is no additional uapi
> > change required to support that.
> > 
> > > Irc discussions by Maarten and Dave came up with the idea that maybe
> > > userptr for vm_bind should work _without_ any gem bo as backing storage,
> > > since that guarantees that people don't come up with funny ideas like
> > > trying to share such bo across process or mmap it and other nonsense which
> > > just doesn't work.
> > > 
> > 
> > Hmm...there is no plan to support userptr _without_ gem bo not atleast in
> > the initial vm_bind support. Is it Ok to put it in the 'futues' section?
> > 
> > > > +
> > > > +
> > > > +Broder i915 cleanups
> > > > +=====================
> > > > +Supporting this whole new vm_bind mode of binding which comes with its own
> > > > +usecases to support and the locking requirements requires proper integration
> > > > +with the existing i915 driver. This calls for some broader i915 driver
> > > > +cleanups/simplifications for maintainability of the driver going forward.
> > > > +Here are few things identified and are being looked into.
> > > > +
> > > > +- Make pagetable allocations evictable and manage them similar to VM_BIND
> > > > +  mapped objects. Page table pages are similar to persistent mappings of a
> > > > +  VM (difference here are that the page table pages will not
> > > > +  have an i915_vma structure and after swapping pages back in, parent page
> > > > +  link needs to be updated).
> > > 
> > > See above, but I think this should be included as part of the initial
> > > vm_bind push.
> > > 
> > 
> > Ok, as you mentioned above, we can do it soon after initial vm_bind support
> > lands, but before we add any new vm_bind features.
> > 
> > > > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> > > > +  do not use it and complexity it brings in is probably more than the
> > > > +  performance advantage we get in legacy execbuff case.
> > > > +- Remove vma->open_count counting
> > > > +- Remove i915_vma active reference tracking. Instead use underlying BO's
> > > > +  dma-resv fence list to determine if a i915_vma is active or not.
> > > 
> > > So this is a complete mess, and really should not exist. I think it needs
> > > to be removed before we try to make i915_vma even more complex by adding
> > > vm_bind.
> > > 
> > 
> > Hmm...Need to look into this. I am not sure how much of an effort it is going
> > to be to remove i915_vma active reference tracking and instead use dma_resv
> > fences for activeness tracking.
> > 
> > > The other thing I've been pondering here is that vm_bind is really
> > > completely different from legacy vm structures for a lot of reasons:
> > > - no relocation or softpin handling, which means vm_bind has no reason to
> > > ever look at the i915_vma structure in execbuf code. Unfortunately
> > > execbuf has been rewritten to be vma instead of obj centric, so it's a
> > > 100% mismatch
> > > 
> > > - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
> > > that because the kernel manages the virtual address space fully. Again
> > > ideally that entire vma_move_to_active code and everything related to it
> > > would simply not exist.
> > > 
> > > - similar on the eviction side, the rules are quite different: For vm_bind
> > > we never tear down the vma, instead it's just moved to the list of
> > > evicted vma. Legacy vm have no need for all these additional lists, so
> > > another huge confusion.
> > > 
> > > - if the refcount is done correctly for vm_bind we wouldn't need the
> > > tricky code in the bo close paths. Unfortunately legacy vm with
> > > relocations and softpin require that vma are only a weak reference, so
> > > that cannot be removed.
> > > 
> > > - there's also a ton of special cases for ggtt handling, like the
> > > different views (for display, partial views for mmap), but also the
> > > gen2/3 alignment and padding requirements which vm_bind never needs.
> > > 
> > > I think the right thing here is to massively split the implementation
> > > behind some solid vm/vma abstraction, with a base clase for vm and vma
> > > which _only_ has the pieces which both vm_bind and the legacy vm stuff
> > > needs. But it's a bit tricky to get there. I think a workable path would
> > > be:
> > > - Add a new base class to both i915_address_space and i915_vma, which
> > > starts out empty.
> > > 
> > > - As vm_bind code lands, move things that vm_bind code needs into these
> > > base classes
> > > 
> > 
> > Ok
> > 
> > > - The goal should be that these base classes are a stand-alone library
> > > that other drivers could reuse. Like we've done with the buddy
> > > allocator, which first moved from i915-gem to i915-ttm, and which amd
> > > now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
> > > interested in adding something like vm_bind should be involved from the
> > > start (or maybe the entire thing reused in amdgpu, they're looking at
> > > vk sparse binding support too or at least have perf issues I think).
> > > 
> > > - Locking must be the same across all implemntations, otherwise it's
> > > really not an abstract. i915 screwed this up terribly by having
> > > different locking rules for ppgtt and ggtt, which is just nonsense.
> > > 
> > > - The legacy specific code needs to be extracted as much as possible and
> > > shoved into separate files. In execbuf this means we need to get back to
> > > object centric flow, and the slowpaths need to become a lot simpler
> > > again (Maarten has cleaned up some of this, but there's still a silly
> > > amount of hacks in there with funny layering).
> > > 
> > 
> > This also, we can do soon after vm_bind code lands right?
> > 
> > > - I think if stuff like the vma eviction details (list movement and
> > > locking and refcounting of the underlying object)
> > > 
> > > > +
> > > > +These can be worked upon after intitial vm_bind support is added.
> > > 
> > > I don't think that works, given how badly i915-gem team screwed up in
> > > other places. And those places had to be fixed by adopting shared code
> > > like ttm. Plus there's already a huge unfulffiled promise pending with the
> > > drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
> > > 
> > 
> > Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
> > active reference tracking code from i915 driver. Wonder if there is any
> > middle ground here like not using that in vm_bind mode?
> > 
> > Niranjana
> > 
> > > Cheers, Daniel
> > > 
> > > > +
> > > > +
> > > > +UAPI
> > > > +=====
> > > > +Uapi definiton can be found here:
> > > > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > > index 91e93a705230..7d10c36b268d 100644
> > > > --- a/Documentation/gpu/rfc/index.rst
> > > > +++ b/Documentation/gpu/rfc/index.rst
> > > > @@ -23,3 +23,7 @@ host such documentation:
> > > > .. toctree::
> > > > 
> > > >     i915_scheduler.rst
> > > > +
> > > > +.. toctree::
> > > > +
> > > > +    i915_vm_bind.rst
> > > > --
> > > > 2.21.0.rc0.32.g243a4c7e27
> > > > 
> > > 
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-04-28 12:29           ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-04-28 12:29 UTC (permalink / raw)
  To: Niranjana Vishwanathapura
  Cc: Dave Airlie, intel-gfx, dri-devel, chris.p.wilson, daniel.vetter,
	thomas.hellstrom, Christian König, Ben Skeggs

On Wed, Apr 27, 2022 at 08:41:35AM -0700, Niranjana Vishwanathapura wrote:
> On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
> > On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
> > > Adding a pile of people who've expressed interest in vm_bind for their
> > > drivers.
> > > 
> > > Also note to the intel folks: This is largely written with me having my
> > > subsystem co-maintainer hat on, i.e. what I think is the right thing to do
> > > here for the subsystem at large. There is substantial rework involved
> > > here, but it's not any different from i915 adopting ttm or i915 adpoting
> > > drm/sched, and I do think this stuff needs to happen in one form or
> > > another.
> > > 
> > > On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
> > > > VM_BIND design document with description of intended use cases.
> > > > 
> > > > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > > ---
> > > > Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
> > > > Documentation/gpu/rfc/index.rst        |   4 +
> > > > 2 files changed, 214 insertions(+)
> > > > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> > > > 
> > > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > new file mode 100644
> > > > index 000000000000..cdc6bb25b942
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> > > > @@ -0,0 +1,210 @@
> > > > +==========================================
> > > > +I915 VM_BIND feature design and use cases
> > > > +==========================================
> > > > +
> > > > +VM_BIND feature
> > > > +================
> > > > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> > > > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
> > > > +a specified address space (VM).
> > > > +
> > > > +These mappings (also referred to as persistent mappings) will be persistent
> > > > +across multiple GPU submissions (execbuff) issued by the UMD, without user
> > > > +having to provide a list of all required mappings during each submission
> > > > +(as required by older execbuff mode).
> > > > +
> > > > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
> > > > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
> > > > +flag is set (useful if mapping is required for an active context).
> > > 
> > > So this is a screw-up I've done, and for upstream I think we need to fix
> > > it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
> > > I was wrong suggesting we should do this a few years back when we kicked
> > > this off internally :-(
> > > 
> > > What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
> > > things on top:
> > > - in and out fences, like with execbuf, to allow userspace to sync with
> > > execbuf as needed
> > > - for compute-mode context this means userspace memory fences
> > > - for legacy context this means a timeline syncobj in drm_syncobj
> > > 
> > > No sync_file or anything else like this at all. This means a bunch of
> > > work, but also it'll have benefits because it means we should be able to
> > > use exactly the same code paths and logic for both compute and for legacy
> > > context, because drm_syncobj support future fence semantics.
> > > 
> > 
> > Thanks Daniel,
> > Ok, will update
> > 
> 
> I had a long conversation with Daniel on some of the points discussed here.
> Thanks to Daniel for clarifying many points here.
> 
> Here is the summary of the discussion.
> 
> 1) A prep patch is needed to update documentation of some existing uapi and this
>   new VM_BIND uapi can update/refer to that.
>   I will include this prep patch in the next revision of this RFC series.
>   Will also include the uapi header file in the rst file so that it gets rendered.
> 
> 2) Will update documentation here with proper use of dma_resv_usage while adding
>   fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default
>   if not override with execlist in execbuff path.
> 
> 3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
>   instead of having to pass it as a BO handle in execlist. This will also make the
>   execlist usage solely for implicit sync setting which is further discussed below.
> 
> 4) Need to look into when will Jason's dma-buf fence import/export ioctl support will
>   land and whether it will be used both for vl and gl. Need to sync with Jason on this.
>   Probably the better option here would be to not support execlist in execbuff path in
>   vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export
>   ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode
>   later if required (say for gl).

So I'm again less sure whether the import/export ioctl is the right thing
for gl, but I still think we should try. The reason is that we really
need to set the implicit sync set per execbuf, otherwise there's oversync
issues. So one of the ideas we've discussed where the implicit sync set
would be controlled through vm_bind doesn't work for gl.
-Daniel

> 
> 5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
>   relocations, implicit sync etc). Separate them out by using function pointers wherever
>   the functionality differs between current design and the newer VM_BIND design.
> 
> 6) Separate out i915_vma active reference counting in execbuff path and do not use it in
>   VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier
>   to get it working with the current TTM backend (which initial VM_BIND support will use).
>   And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
> 
> 7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires
>   support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler
>   conversion.
> 
> Will revise this series accordingly.
> 
> Thanks,
> Niranjana
> 
> > > Also on the implementation side we still need to install dma_fence to the
> > > various dma_resv, and for this we need the new dma_resv_usage series from
> > > Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
> > > flag to make sure they never result in an oversync issue with execbuf. I
> > > don't think trying to land vm_bind without that prep work in
> > > dma_resv_usage makes sense.
> > > 
> > 
> > Ok, but that is not a dependency for this VM_BIND design RFC patch right?
> > I will add this to the documentation here.
> > 
> > > Also as soon as dma_resv_usage has landed there's a few cleanups we should
> > > do in i915:
> > > - ttm bo moving code should probably simplify a bit (and maybe more of the
> > > code should be pushed as helpers into ttm)
> > > - clflush code should be moved over to using USAGE_KERNEL and the various
> > > hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
> > > expand on the kernel-doc for cache_dirty") for a bit more context
> > > 
> > > This is still not yet enough, since if a vm_bind races with an eviction we
> > > might stall on the new buffers being readied first before the context can
> > > continue. This needs some care to make sure that vma which aren't fully
> > > bound yet are on a separate list, and vma which are marked for unbinding
> > > are removed from the main working set list as soon as possible.
> > > 
> > > All of these things are relevant for the uapi semantics, which means
> > > - they need to be documented in the uapi kerneldoc, ideally with example
> > > flows
> > > - umd need to ack this
> > > 
> > 
> > Ok
> > 
> > > The other thing here is the async/nonblocking path. I think we still need
> > > that one, but again it should not sync with anything going on in execbuf,
> > > but simply execute the ioctl code in a kernel thread. The idea here is
> > > that this works like a special gpu engine, so that compute and vk can
> > > schedule bindings interleaved with rendering. This should be enough to get
> > > a performant vk sparse binding/textures implementation.
> > > 
> > > But I'm not entirely sure on this one, so this definitely needs acks from
> > > umds.
> > > 
> > > > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> > > > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> > > > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> > > > +A VM in VM_BIND mode will not support older execbuff mode of binding.
> > > > +
> > > > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
> > > > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
> > > > +but those BOs should be mapped ahead via vm_bind ioctl.
> > > 
> > > should or must?
> > > 
> > 
> > Must, will fix.
> > 
> > > Also I'm not really sure that's a great interface. The batchbuffer really
> > > only needs to be an address, so maybe all we need is an extension to
> > > supply an u64 batchbuffer address instead of trying to retrofit this into
> > > an unfitting current uapi.
> > > 
> > 
> > Yah, this was considered, but was decided to do it as later optimization.
> > But if we were to remove execlist entries completely (ie., no implicit
> > sync also), then we need to do this from the beginning.
> > 
> > > And for implicit sync there's two things:
> > > - for vk I think the right uapi is the dma-buf fence import/export ioctls
> > > from Jason Ekstrand. I think we should land that first instead of
> > > hacking funny concepts together
> > 
> > I did not understand fully, can you point to it?
> > 
> > > - for gl the dma-buf import/export might not be fast enough, since gl
> > > needs to do a _lot_ of implicit sync. There we might need to use the
> > > execbuffer buffer list, but then we should have extremely clear uapi
> > > rules which disallow _everything_ except setting the explicit sync uapi
> > > 
> > 
> > Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
> > 
> > > Again all this stuff needs to be documented in detail in the kerneldoc
> > > uapi spec.
> > > 
> > 
> > ok
> > 
> > > > +VM_BIND features include,
> > > > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
> > > > +  of an object (aliasing).
> > > > +- VA mapping can map to a partial section of the BO (partial binding).
> > > > +- Support capture of persistent mappings in the dump upon GPU error.
> > > > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
> > > > +  usecases will be helpful.
> > > > +- Asynchronous vm_bind and vm_unbind support.
> > > > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
> > > > +  and for signaling batch completion in long running contexts (explained
> > > > +  below).
> > > 
> > > This should all be in the kerneldoc.
> > > 
> > 
> > ok
> > 
> > > > +VM_PRIVATE objects
> > > > +------------------
> > > > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> > > > +exported. Hence these BOs are referred to as Shared BOs.
> > > > +During each execbuff submission, the request fence must be added to the
> > > > +dma-resv fence list of all shared BOs mapped on the VM.
> > > > +
> > > > +VM_BIND feature introduces an optimization where user can create BO which
> > > > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> > > > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> > > > +the VM they are private to and can't be dma-buf exported.
> > > > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> > > > +submission, they need only one dma-resv fence list updated. Thus the fast
> > > > +path (where required mappings are already bound) submission latency is O(1)
> > > > +w.r.t the number of VM private BOs.
> > > 
> > > Two things:
> > > 
> > > - I think the above is required to for initial vm_bind for vk, it kinda
> > > doesn't make much sense without that, and will allow us to match amdgpu
> > > and radeonsi
> > > 
> > > - Christian König just landed ttm bulk lru helpers, and I think we need to
> > > use those. This means vm_bind will only work with the ttm backend, but
> > > that's what we have for the big dgpu where vm_bind helps more in terms
> > > of performance, and the igfx conversion to ttm is already going on.
> > > 
> > 
> > ok
> > 
> > > Furthermore the i915 shrinker lru has stopped being an lru, so I think
> > > that should also be moved over to the ttm lru in some fashion to make sure
> > > we once again have a reasonable and consistent memory aging and reclaim
> > > architecture. The current code is just too much of a complete mess.
> > > 
> > > And since this is all fairly integral to how the code arch works I don't
> > > think merging a different version which isn't based on ttm bulk lru
> > > helpers makes sense.
> > > 
> > > Also I do think the page table lru handling needs to be included here,
> > > because that's another complete hand-rolled separate world for not much
> > > good reasons. I guess that can happen in parallel with the initial vm_bind
> > > bring-up, but it needs to be completed by the time we add the features
> > > beyond the initial support needed for vk.
> > > 
> > 
> > Ok
> > 
> > > > +VM_BIND locking hirarchy
> > > > +-------------------------
> > > > +VM_BIND locking order is as below.
> > > > +
> > > > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
> > > > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
> > > > +
> > > > +   In future, when GPU page faults are supported, we can potentially use a
> > > > +   rwsem instead, so that multiple pagefault handlers can take the read side
> > > > +   lock to lookup the mapping and hence can run in parallel.
> > > > +
> > > > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
> > > > +   while binding a vma and while updating dma-resv fence list of a BO.
> > > > +   The private BOs of a VM will all share a dma-resv object.
> > > > +
> > > > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
> > > > +   call for unbinding and during execbuff path for binding the mapping and
> > > > +   updating the dma-resv fence list of the BO.
> > > > +
> > > > +3) Spinlock/s to protect some of the VM's lists.
> > > > +
> > > > +We will also need support for bluk LRU movement of persistent mapping to
> > > > +avoid additional latencies in execbuff path.
> > > 
> > > This needs more detail and explanation of how each level is required. Also
> > > the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
> > > 
> > > Like "some of the VM's lists" explains pretty much nothing.
> > > 
> > 
> > Ok, will explain.
> > 
> > > > +
> > > > +GPU page faults
> > > > +----------------
> > > > +Both older execbuff mode and the newer VM_BIND mode of binding will require
> > > > +using dma-fence to ensure residency.
> > > > +In future when GPU page faults are supported, no dma-fence usage is required
> > > > +as residency is purely managed by installing and removing/invalidating ptes.
> > > 
> > > This is a bit confusing. I think one part of this should be moved into the
> > > section with future vm_bind use-cases (we're not going to support page
> > > faults with legacy softpin or even worse, relocations). The locking
> > > discussion should be part of the much longer list of uses cases that
> > > motivate the locking design.
> > > 
> > 
> > Ok, will move.
> > 
> > > > +
> > > > +
> > > > +User/Memory Fence
> > > > +==================
> > > > +The idea is to take a user specified virtual address and install an interrupt
> > > > +handler to wake up the current task when the memory location passes the user
> > > > +supplied filter.
> > > > +
> > > > +User/Memory fence is a <address, value> pair. To signal the user fence,
> > > > +specified value will be written at the specified virtual address and
> > > > +wakeup the waiting process. User can wait on an user fence with the
> > > > +gem_wait_user_fence ioctl.
> > > > +
> > > > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> > > > +interrupt within their batches after updating the value to have sub-batch
> > > > +precision on the wakeup. Each batch can signal an user fence to indicate
> > > > +the completion of next level batch. The completion of very first level batch
> > > > +needs to be signaled by the command streamer. The user must provide the
> > > > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> > > > +extension of execbuff ioctl, so that KMD can setup the command streamer to
> > > > +signal it.
> > > > +
> > > > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> > > > +the user process after completion of an asynchronous operation.
> > > > +
> > > > +When VM_BIND ioctl was provided with a user/memory fence via the
> > > > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> > > > +of binding of that mapping. All async binds/unbinds are serialized, hence
> > > > +signaling of user/memory fence also indicate the completion of all previous
> > > > +binds/unbinds.
> > > > +
> > > > +This feature will be derived from the below original work:
> > > > +https://patchwork.freedesktop.org/patch/349417/
> > > 
> > > This is 1:1 tied to long running compute mode contexts (which in the uapi
> > > doc must reference the endless amounts of bikeshed summary we have in the
> > > docs about indefinite fences).
> > > 
> > 
> > Ok, will check and add reference.
> > 
> > > I'd put this into a new section about compute and userspace memory fences
> > > support, with this and the next chapter ...
> > 
> > ok
> > 
> > > > +
> > > > +
> > > > +VM_BIND use cases
> > > > +==================
> > > 
> > > ... and then make this section here focus entirely on additional vm_bind
> > > use-cases that we'll be adding later on. Which doesn't need to go into any
> > > details, it's just justification for why we want to build the world on top
> > > of vm_bind.
> > > 
> > 
> > ok
> > 
> > > > +
> > > > +Long running Compute contexts
> > > > +------------------------------
> > > > +Usage of dma-fence expects that they complete in reasonable amount of time.
> > > > +Compute on the other hand can be long running. Hence it is appropriate for
> > > > +compute to use user/memory fence and dma-fence usage will be limited to
> > > > +in-kernel consumption only. This requires an execbuff uapi extension to pass
> > > > +in user fence. Compute must opt-in for this mechanism with
> > > > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
> > > > +
> > > > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
> > > > +and implicit dependency setting is not allowed on long running contexts.
> > > > +
> > > > +Where GPU page faults are not available, kernel driver upon buffer invalidation
> > > > +will initiate a suspend (preemption) of long running context with a dma-fence
> > > > +attached to it. And upon completion of that suspend fence, finish the
> > > > +invalidation, revalidate the BO and then resume the compute context. This is
> > > > +done by having a per-context fence (called suspend fence) proxying as
> > > > +i915_request fence. This suspend fence is enabled when there is a wait on it,
> > > > +which triggers the context preemption.
> > > > +
> > > > +This is much easier to support with VM_BIND compared to the current heavier
> > > > +execbuff path resource attachment.
> > > 
> > > There's a bunch of tricky code around compute mode context support, like
> > > the preempt ctx fence (or suspend fence or whatever you want to call it),
> > > and the resume work. And I think that code should be shared across
> > > drivers.
> > > 
> > > I think the right place to put this is into drm/sched, somewhere attached
> > > to the drm_sched_entity structure. I expect i915 folks to collaborate with
> > > amd and ideally also get amdkfd to adopt the same thing if possible. At
> > > least Christian has mentioned in the past that he's a bit unhappy about
> > > how this works.
> > > 
> > > Also drm/sched has dependency tracking, which will be needed to pipeline
> > > context resume operations. That needs to be used instead of i915-gem
> > > inventing yet another dependency tracking data structure (it already has 3
> > > and that's roughly 3 too many).
> > > 
> > > This means compute mode support and userspace memory fences are blocked on
> > > the drm/sched conversion, but *eh* add it to the list of reasons for why
> > > drm/sched needs to happen.
> > > 
> > > Also since we only have support for compute mode ctx in our internal tree
> > > with the guc scheduler backend anyway, and the first conversion target is
> > > the guc backend, I don't think this actually holds up a lot of the code.
> > > 
> > 
> > Hmm...ok. Currently, the context suspend and resume operations in out
> > internal tree is through an orthogonal guc interface (not through scheduler).
> > So, I need to look more into this part.
> > 
> > > > +Low Latency Submission
> > > > +-----------------------
> > > > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> > > > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
> > > 
> > > This is really just a special case of compute mode contexts, I think I'd
> > > include that in there, but explain better what it requires (i.e. vm_bind
> > > not being synchronized against execbuf).
> > > 
> > 
> > ok
> > 
> > > > +
> > > > +Debugger
> > > > +---------
> > > > +With debug event interface user space process (debugger) is able to keep track
> > > > +of and act upon resources created by another process (debuggee) and attached
> > > > +to GPU via vm_bind interface.
> > > > +
> > > > +Mesa/Valkun
> > > > +------------
> > > > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
> > > > +performance. For Vulkan it should be straightforward to use VM_BIND.
> > > > +For Iris implicit buffer tracking must be implemented before we can harness
> > > > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
> > > > +overhead becomes more important.
> > > 
> > > Just to clarify, I don't think we can land vm_bind into upstream if it
> > > doesn't work 100% for vk. There's a bit much "can" instead of "will in
> > > this section".
> > > 
> > 
> > ok, will explain better.
> > 
> > > > +
> > > > +Page level hints settings
> > > > +--------------------------
> > > > +VM_BIND allows any hints setting per mapping instead of per BO.
> > > > +Possible hints include read-only, placement and atomicity.
> > > > +Sub-BO level placement hint will be even more relevant with
> > > > +upcoming GPU on-demand page fault support.
> > > > +
> > > > +Page level Cache/CLOS settings
> > > > +-------------------------------
> > > > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> > > > +
> > > > +Shared Virtual Memory (SVM) support
> > > > +------------------------------------
> > > > +VM_BIND interface can be used to map system memory directly (without gem BO
> > > > +abstraction) using the HMM interface.
> > > 
> > > Userptr is absent here (and it's not the same as svm, at least on
> > > discrete), and this is needed for the initial version since otherwise vk
> > > can't use it because we're not at feature parity.
> > > 
> > 
> > userptr gem objects are supported in initial version (and yes it is not
> > same as SVM). I did not add it here as there is no additional uapi
> > change required to support that.
> > 
> > > Irc discussions by Maarten and Dave came up with the idea that maybe
> > > userptr for vm_bind should work _without_ any gem bo as backing storage,
> > > since that guarantees that people don't come up with funny ideas like
> > > trying to share such bo across process or mmap it and other nonsense which
> > > just doesn't work.
> > > 
> > 
> > Hmm...there is no plan to support userptr _without_ gem bo not atleast in
> > the initial vm_bind support. Is it Ok to put it in the 'futues' section?
> > 
> > > > +
> > > > +
> > > > +Broder i915 cleanups
> > > > +=====================
> > > > +Supporting this whole new vm_bind mode of binding which comes with its own
> > > > +usecases to support and the locking requirements requires proper integration
> > > > +with the existing i915 driver. This calls for some broader i915 driver
> > > > +cleanups/simplifications for maintainability of the driver going forward.
> > > > +Here are few things identified and are being looked into.
> > > > +
> > > > +- Make pagetable allocations evictable and manage them similar to VM_BIND
> > > > +  mapped objects. Page table pages are similar to persistent mappings of a
> > > > +  VM (difference here are that the page table pages will not
> > > > +  have an i915_vma structure and after swapping pages back in, parent page
> > > > +  link needs to be updated).
> > > 
> > > See above, but I think this should be included as part of the initial
> > > vm_bind push.
> > > 
> > 
> > Ok, as you mentioned above, we can do it soon after initial vm_bind support
> > lands, but before we add any new vm_bind features.
> > 
> > > > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> > > > +  do not use it and complexity it brings in is probably more than the
> > > > +  performance advantage we get in legacy execbuff case.
> > > > +- Remove vma->open_count counting
> > > > +- Remove i915_vma active reference tracking. Instead use underlying BO's
> > > > +  dma-resv fence list to determine if a i915_vma is active or not.
> > > 
> > > So this is a complete mess, and really should not exist. I think it needs
> > > to be removed before we try to make i915_vma even more complex by adding
> > > vm_bind.
> > > 
> > 
> > Hmm...Need to look into this. I am not sure how much of an effort it is going
> > to be to remove i915_vma active reference tracking and instead use dma_resv
> > fences for activeness tracking.
> > 
> > > The other thing I've been pondering here is that vm_bind is really
> > > completely different from legacy vm structures for a lot of reasons:
> > > - no relocation or softpin handling, which means vm_bind has no reason to
> > > ever look at the i915_vma structure in execbuf code. Unfortunately
> > > execbuf has been rewritten to be vma instead of obj centric, so it's a
> > > 100% mismatch
> > > 
> > > - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
> > > that because the kernel manages the virtual address space fully. Again
> > > ideally that entire vma_move_to_active code and everything related to it
> > > would simply not exist.
> > > 
> > > - similar on the eviction side, the rules are quite different: For vm_bind
> > > we never tear down the vma, instead it's just moved to the list of
> > > evicted vma. Legacy vm have no need for all these additional lists, so
> > > another huge confusion.
> > > 
> > > - if the refcount is done correctly for vm_bind we wouldn't need the
> > > tricky code in the bo close paths. Unfortunately legacy vm with
> > > relocations and softpin require that vma are only a weak reference, so
> > > that cannot be removed.
> > > 
> > > - there's also a ton of special cases for ggtt handling, like the
> > > different views (for display, partial views for mmap), but also the
> > > gen2/3 alignment and padding requirements which vm_bind never needs.
> > > 
> > > I think the right thing here is to massively split the implementation
> > > behind some solid vm/vma abstraction, with a base clase for vm and vma
> > > which _only_ has the pieces which both vm_bind and the legacy vm stuff
> > > needs. But it's a bit tricky to get there. I think a workable path would
> > > be:
> > > - Add a new base class to both i915_address_space and i915_vma, which
> > > starts out empty.
> > > 
> > > - As vm_bind code lands, move things that vm_bind code needs into these
> > > base classes
> > > 
> > 
> > Ok
> > 
> > > - The goal should be that these base classes are a stand-alone library
> > > that other drivers could reuse. Like we've done with the buddy
> > > allocator, which first moved from i915-gem to i915-ttm, and which amd
> > > now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
> > > interested in adding something like vm_bind should be involved from the
> > > start (or maybe the entire thing reused in amdgpu, they're looking at
> > > vk sparse binding support too or at least have perf issues I think).
> > > 
> > > - Locking must be the same across all implemntations, otherwise it's
> > > really not an abstract. i915 screwed this up terribly by having
> > > different locking rules for ppgtt and ggtt, which is just nonsense.
> > > 
> > > - The legacy specific code needs to be extracted as much as possible and
> > > shoved into separate files. In execbuf this means we need to get back to
> > > object centric flow, and the slowpaths need to become a lot simpler
> > > again (Maarten has cleaned up some of this, but there's still a silly
> > > amount of hacks in there with funny layering).
> > > 
> > 
> > This also, we can do soon after vm_bind code lands right?
> > 
> > > - I think if stuff like the vma eviction details (list movement and
> > > locking and refcounting of the underlying object)
> > > 
> > > > +
> > > > +These can be worked upon after intitial vm_bind support is added.
> > > 
> > > I don't think that works, given how badly i915-gem team screwed up in
> > > other places. And those places had to be fixed by adopting shared code
> > > like ttm. Plus there's already a huge unfulffiled promise pending with the
> > > drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
> > > 
> > 
> > Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
> > active reference tracking code from i915 driver. Wonder if there is any
> > middle ground here like not using that in vm_bind mode?
> > 
> > Niranjana
> > 
> > > Cheers, Daniel
> > > 
> > > > +
> > > > +
> > > > +UAPI
> > > > +=====
> > > > +Uapi definiton can be found here:
> > > > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > > index 91e93a705230..7d10c36b268d 100644
> > > > --- a/Documentation/gpu/rfc/index.rst
> > > > +++ b/Documentation/gpu/rfc/index.rst
> > > > @@ -23,3 +23,7 @@ host such documentation:
> > > > .. toctree::
> > > > 
> > > >     i915_scheduler.rst
> > > > +
> > > > +.. toctree::
> > > > +
> > > > +    i915_vm_bind.rst
> > > > --
> > > > 2.21.0.rc0.32.g243a4c7e27
> > > > 
> > > 
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
  2022-04-27 15:41         ` [Intel-gfx] " Niranjana Vishwanathapura
@ 2022-05-09 23:11           ` Niranjana Vishwanathapura
  -1 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-09 23:11 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, intel-gfx, chris.p.wilson, Bloomfield, Jon,
	dri-devel, Jason Ekstrand, daniel.vetter, thomas.hellstrom,
	Christian König, Ben Skeggs

On Wed, Apr 27, 2022 at 08:41:35AM -0700, Niranjana Vishwanathapura wrote:
>On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
>>On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
>>>Adding a pile of people who've expressed interest in vm_bind for their
>>>drivers.
>>>
>>>Also note to the intel folks: This is largely written with me having my
>>>subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>>>here for the subsystem at large. There is substantial rework involved
>>>here, but it's not any different from i915 adopting ttm or i915 adpoting
>>>drm/sched, and I do think this stuff needs to happen in one form or
>>>another.
>>>
>>>On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>>>>VM_BIND design document with description of intended use cases.
>>>>
>>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>>---
>>>>Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>>>Documentation/gpu/rfc/index.rst        |   4 +
>>>>2 files changed, 214 insertions(+)
>>>>create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>>>
>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>>new file mode 100644
>>>>index 000000000000..cdc6bb25b942
>>>>--- /dev/null
>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>>@@ -0,0 +1,210 @@
>>>>+==========================================
>>>>+I915 VM_BIND feature design and use cases
>>>>+==========================================
>>>>+
>>>>+VM_BIND feature
>>>>+================
>>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>>>>+a specified address space (VM).
>>>>+
>>>>+These mappings (also referred to as persistent mappings) will be persistent
>>>>+across multiple GPU submissions (execbuff) issued by the UMD, without user
>>>>+having to provide a list of all required mappings during each submission
>>>>+(as required by older execbuff mode).
>>>>+
>>>>+VM_BIND ioctl deferes binding the mappings until next execbuff submission
>>>>+where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>>>>+flag is set (useful if mapping is required for an active context).
>>>
>>>So this is a screw-up I've done, and for upstream I think we need to fix
>>>it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>>>I was wrong suggesting we should do this a few years back when we kicked
>>>this off internally :-(
>>>
>>>What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>>>things on top:
>>>- in and out fences, like with execbuf, to allow userspace to sync with
>>>execbuf as needed
>>>- for compute-mode context this means userspace memory fences
>>>- for legacy context this means a timeline syncobj in drm_syncobj
>>>
>>>No sync_file or anything else like this at all. This means a bunch of
>>>work, but also it'll have benefits because it means we should be able to
>>>use exactly the same code paths and logic for both compute and for legacy
>>>context, because drm_syncobj support future fence semantics.
>>>
>>
>>Thanks Daniel,
>>Ok, will update
>>
>
>I had a long conversation with Daniel on some of the points discussed here.
>Thanks to Daniel for clarifying many points here.
>
>Here is the summary of the discussion.
>
>1) A prep patch is needed to update documentation of some existing uapi and this
>  new VM_BIND uapi can update/refer to that.
>  I will include this prep patch in the next revision of this RFC series.
>  Will also include the uapi header file in the rst file so that it gets rendered.
>
>2) Will update documentation here with proper use of dma_resv_usage while adding
>  fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default
>  if not override with execlist in execbuff path.
>
>3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
>  instead of having to pass it as a BO handle in execlist. This will also make the
>  execlist usage solely for implicit sync setting which is further discussed below.
>
>4) Need to look into when will Jason's dma-buf fence import/export ioctl support will
>  land and whether it will be used both for vl and gl. Need to sync with Jason on this.
>  Probably the better option here would be to not support execlist in execbuff path in
>  vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export
>  ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode
>  later if required (say for gl).
>
>5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
>  relocations, implicit sync etc). Separate them out by using function pointers wherever
>  the functionality differs between current design and the newer VM_BIND design.
>
>6) Separate out i915_vma active reference counting in execbuff path and do not use it in
>  VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier
>  to get it working with the current TTM backend (which initial VM_BIND support will use).
>  And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
>
>7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires
>  support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler
>  conversion.
>
>Will revise this series accordingly.
>

I was prototyping some of these and they look good.
Still need to address few opens on dma-resv fence usage for VM_BIND. Like,
how to effectively update fence list during VM_BIND (for non VM private
objects).

I will be addressing these review comments and hoping to post updated
patch series by the end of this week or so.

Thanks,
Niranjana

>Thanks,
>Niranjana
>
>>>Also on the implementation side we still need to install dma_fence to the
>>>various dma_resv, and for this we need the new dma_resv_usage series from
>>>Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>>>flag to make sure they never result in an oversync issue with execbuf. I
>>>don't think trying to land vm_bind without that prep work in
>>>dma_resv_usage makes sense.
>>>
>>
>>Ok, but that is not a dependency for this VM_BIND design RFC patch right?
>>I will add this to the documentation here.
>>
>>>Also as soon as dma_resv_usage has landed there's a few cleanups we should
>>>do in i915:
>>>- ttm bo moving code should probably simplify a bit (and maybe more of the
>>>code should be pushed as helpers into ttm)
>>>- clflush code should be moved over to using USAGE_KERNEL and the various
>>>hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>>>expand on the kernel-doc for cache_dirty") for a bit more context
>>>
>>>This is still not yet enough, since if a vm_bind races with an eviction we
>>>might stall on the new buffers being readied first before the context can
>>>continue. This needs some care to make sure that vma which aren't fully
>>>bound yet are on a separate list, and vma which are marked for unbinding
>>>are removed from the main working set list as soon as possible.
>>>
>>>All of these things are relevant for the uapi semantics, which means
>>>- they need to be documented in the uapi kerneldoc, ideally with example
>>>flows
>>>- umd need to ack this
>>>
>>
>>Ok
>>
>>>The other thing here is the async/nonblocking path. I think we still need
>>>that one, but again it should not sync with anything going on in execbuf,
>>>but simply execute the ioctl code in a kernel thread. The idea here is
>>>that this works like a special gpu engine, so that compute and vk can
>>>schedule bindings interleaved with rendering. This should be enough to get
>>>a performant vk sparse binding/textures implementation.
>>>
>>>But I'm not entirely sure on this one, so this definitely needs acks from
>>>umds.
>>>
>>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>>+A VM in VM_BIND mode will not support older execbuff mode of binding.
>>>>+
>>>>+UMDs can still send BOs of these persistent mappings in execlist of execbuff
>>>>+for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>>>>+but those BOs should be mapped ahead via vm_bind ioctl.
>>>
>>>should or must?
>>>
>>
>>Must, will fix.
>>
>>>Also I'm not really sure that's a great interface. The batchbuffer really
>>>only needs to be an address, so maybe all we need is an extension to
>>>supply an u64 batchbuffer address instead of trying to retrofit this into
>>>an unfitting current uapi.
>>>
>>
>>Yah, this was considered, but was decided to do it as later optimization.
>>But if we were to remove execlist entries completely (ie., no implicit
>>sync also), then we need to do this from the beginning.
>>
>>>And for implicit sync there's two things:
>>>- for vk I think the right uapi is the dma-buf fence import/export ioctls
>>>from Jason Ekstrand. I think we should land that first instead of
>>>hacking funny concepts together
>>
>>I did not understand fully, can you point to it?
>>
>>>- for gl the dma-buf import/export might not be fast enough, since gl
>>>needs to do a _lot_ of implicit sync. There we might need to use the
>>>execbuffer buffer list, but then we should have extremely clear uapi
>>>rules which disallow _everything_ except setting the explicit sync uapi
>>>
>>
>>Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
>>
>>>Again all this stuff needs to be documented in detail in the kerneldoc
>>>uapi spec.
>>>
>>
>>ok
>>
>>>>+VM_BIND features include,
>>>>+- Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>>+  of an object (aliasing).
>>>>+- VA mapping can map to a partial section of the BO (partial binding).
>>>>+- Support capture of persistent mappings in the dump upon GPU error.
>>>>+- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>>+  usecases will be helpful.
>>>>+- Asynchronous vm_bind and vm_unbind support.
>>>>+- VM_BIND uses user/memory fence mechanism for signaling bind completion
>>>>+  and for signaling batch completion in long running contexts (explained
>>>>+  below).
>>>
>>>This should all be in the kerneldoc.
>>>
>>
>>ok
>>
>>>>+VM_PRIVATE objects
>>>>+------------------
>>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>>+During each execbuff submission, the request fence must be added to the
>>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>>+
>>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>>+the VM they are private to and can't be dma-buf exported.
>>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>>+submission, they need only one dma-resv fence list updated. Thus the fast
>>>>+path (where required mappings are already bound) submission latency is O(1)
>>>>+w.r.t the number of VM private BOs.
>>>
>>>Two things:
>>>
>>>- I think the above is required to for initial vm_bind for vk, it kinda
>>>doesn't make much sense without that, and will allow us to match amdgpu
>>>and radeonsi
>>>
>>>- Christian König just landed ttm bulk lru helpers, and I think we need to
>>>use those. This means vm_bind will only work with the ttm backend, but
>>>that's what we have for the big dgpu where vm_bind helps more in terms
>>>of performance, and the igfx conversion to ttm is already going on.
>>>
>>
>>ok
>>
>>>Furthermore the i915 shrinker lru has stopped being an lru, so I think
>>>that should also be moved over to the ttm lru in some fashion to make sure
>>>we once again have a reasonable and consistent memory aging and reclaim
>>>architecture. The current code is just too much of a complete mess.
>>>
>>>And since this is all fairly integral to how the code arch works I don't
>>>think merging a different version which isn't based on ttm bulk lru
>>>helpers makes sense.
>>>
>>>Also I do think the page table lru handling needs to be included here,
>>>because that's another complete hand-rolled separate world for not much
>>>good reasons. I guess that can happen in parallel with the initial vm_bind
>>>bring-up, but it needs to be completed by the time we add the features
>>>beyond the initial support needed for vk.
>>>
>>
>>Ok
>>
>>>>+VM_BIND locking hirarchy
>>>>+-------------------------
>>>>+VM_BIND locking order is as below.
>>>>+
>>>>+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>>>>+   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>>>>+
>>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>>+   rwsem instead, so that multiple pagefault handlers can take the read side
>>>>+   lock to lookup the mapping and hence can run in parallel.
>>>>+
>>>>+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>>>>+   while binding a vma and while updating dma-resv fence list of a BO.
>>>>+   The private BOs of a VM will all share a dma-resv object.
>>>>+
>>>>+   This lock is held in vm_bind call for immediate binding, during vm_unbind
>>>>+   call for unbinding and during execbuff path for binding the mapping and
>>>>+   updating the dma-resv fence list of the BO.
>>>>+
>>>>+3) Spinlock/s to protect some of the VM's lists.
>>>>+
>>>>+We will also need support for bluk LRU movement of persistent mapping to
>>>>+avoid additional latencies in execbuff path.
>>>
>>>This needs more detail and explanation of how each level is required. Also
>>>the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>>
>>>Like "some of the VM's lists" explains pretty much nothing.
>>>
>>
>>Ok, will explain.
>>
>>>>+
>>>>+GPU page faults
>>>>+----------------
>>>>+Both older execbuff mode and the newer VM_BIND mode of binding will require
>>>>+using dma-fence to ensure residency.
>>>>+In future when GPU page faults are supported, no dma-fence usage is required
>>>>+as residency is purely managed by installing and removing/invalidating ptes.
>>>
>>>This is a bit confusing. I think one part of this should be moved into the
>>>section with future vm_bind use-cases (we're not going to support page
>>>faults with legacy softpin or even worse, relocations). The locking
>>>discussion should be part of the much longer list of uses cases that
>>>motivate the locking design.
>>>
>>
>>Ok, will move.
>>
>>>>+
>>>>+
>>>>+User/Memory Fence
>>>>+==================
>>>>+The idea is to take a user specified virtual address and install an interrupt
>>>>+handler to wake up the current task when the memory location passes the user
>>>>+supplied filter.
>>>>+
>>>>+User/Memory fence is a <address, value> pair. To signal the user fence,
>>>>+specified value will be written at the specified virtual address and
>>>>+wakeup the waiting process. User can wait on an user fence with the
>>>>+gem_wait_user_fence ioctl.
>>>>+
>>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>>+interrupt within their batches after updating the value to have sub-batch
>>>>+precision on the wakeup. Each batch can signal an user fence to indicate
>>>>+the completion of next level batch. The completion of very first level batch
>>>>+needs to be signaled by the command streamer. The user must provide the
>>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>>+signal it.
>>>>+
>>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>>+the user process after completion of an asynchronous operation.
>>>>+
>>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>>+signaling of user/memory fence also indicate the completion of all previous
>>>>+binds/unbinds.
>>>>+
>>>>+This feature will be derived from the below original work:
>>>>+https://patchwork.freedesktop.org/patch/349417/
>>>
>>>This is 1:1 tied to long running compute mode contexts (which in the uapi
>>>doc must reference the endless amounts of bikeshed summary we have in the
>>>docs about indefinite fences).
>>>
>>
>>Ok, will check and add reference.
>>
>>>I'd put this into a new section about compute and userspace memory fences
>>>support, with this and the next chapter ...
>>
>>ok
>>
>>>>+
>>>>+
>>>>+VM_BIND use cases
>>>>+==================
>>>
>>>... and then make this section here focus entirely on additional vm_bind
>>>use-cases that we'll be adding later on. Which doesn't need to go into any
>>>details, it's just justification for why we want to build the world on top
>>>of vm_bind.
>>>
>>
>>ok
>>
>>>>+
>>>>+Long running Compute contexts
>>>>+------------------------------
>>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>>+in user fence. Compute must opt-in for this mechanism with
>>>>+I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>>>>+
>>>>+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>>>>+and implicit dependency setting is not allowed on long running contexts.
>>>>+
>>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>>+attached to it. And upon completion of that suspend fence, finish the
>>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>>+done by having a per-context fence (called suspend fence) proxying as
>>>>+i915_request fence. This suspend fence is enabled when there is a wait on it,
>>>>+which triggers the context preemption.
>>>>+
>>>>+This is much easier to support with VM_BIND compared to the current heavier
>>>>+execbuff path resource attachment.
>>>
>>>There's a bunch of tricky code around compute mode context support, like
>>>the preempt ctx fence (or suspend fence or whatever you want to call it),
>>>and the resume work. And I think that code should be shared across
>>>drivers.
>>>
>>>I think the right place to put this is into drm/sched, somewhere attached
>>>to the drm_sched_entity structure. I expect i915 folks to collaborate with
>>>amd and ideally also get amdkfd to adopt the same thing if possible. At
>>>least Christian has mentioned in the past that he's a bit unhappy about
>>>how this works.
>>>
>>>Also drm/sched has dependency tracking, which will be needed to pipeline
>>>context resume operations. That needs to be used instead of i915-gem
>>>inventing yet another dependency tracking data structure (it already has 3
>>>and that's roughly 3 too many).
>>>
>>>This means compute mode support and userspace memory fences are blocked on
>>>the drm/sched conversion, but *eh* add it to the list of reasons for why
>>>drm/sched needs to happen.
>>>
>>>Also since we only have support for compute mode ctx in our internal tree
>>>with the guc scheduler backend anyway, and the first conversion target is
>>>the guc backend, I don't think this actually holds up a lot of the code.
>>>
>>
>>Hmm...ok. Currently, the context suspend and resume operations in out
>>internal tree is through an orthogonal guc interface (not through scheduler).
>>So, I need to look more into this part.
>>
>>>>+Low Latency Submission
>>>>+-----------------------
>>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>>+ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>>
>>>This is really just a special case of compute mode contexts, I think I'd
>>>include that in there, but explain better what it requires (i.e. vm_bind
>>>not being synchronized against execbuf).
>>>
>>
>>ok
>>
>>>>+
>>>>+Debugger
>>>>+---------
>>>>+With debug event interface user space process (debugger) is able to keep track
>>>>+of and act upon resources created by another process (debuggee) and attached
>>>>+to GPU via vm_bind interface.
>>>>+
>>>>+Mesa/Valkun
>>>>+------------
>>>>+VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>>>>+performance. For Vulkan it should be straightforward to use VM_BIND.
>>>>+For Iris implicit buffer tracking must be implemented before we can harness
>>>>+VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>>>>+overhead becomes more important.
>>>
>>>Just to clarify, I don't think we can land vm_bind into upstream if it
>>>doesn't work 100% for vk. There's a bit much "can" instead of "will in
>>>this section".
>>>
>>
>>ok, will explain better.
>>
>>>>+
>>>>+Page level hints settings
>>>>+--------------------------
>>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>>+Possible hints include read-only, placement and atomicity.
>>>>+Sub-BO level placement hint will be even more relevant with
>>>>+upcoming GPU on-demand page fault support.
>>>>+
>>>>+Page level Cache/CLOS settings
>>>>+-------------------------------
>>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>>+
>>>>+Shared Virtual Memory (SVM) support
>>>>+------------------------------------
>>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>>+abstraction) using the HMM interface.
>>>
>>>Userptr is absent here (and it's not the same as svm, at least on
>>>discrete), and this is needed for the initial version since otherwise vk
>>>can't use it because we're not at feature parity.
>>>
>>
>>userptr gem objects are supported in initial version (and yes it is not
>>same as SVM). I did not add it here as there is no additional uapi
>>change required to support that.
>>
>>>Irc discussions by Maarten and Dave came up with the idea that maybe
>>>userptr for vm_bind should work _without_ any gem bo as backing storage,
>>>since that guarantees that people don't come up with funny ideas like
>>>trying to share such bo across process or mmap it and other nonsense which
>>>just doesn't work.
>>>
>>
>>Hmm...there is no plan to support userptr _without_ gem bo not atleast in
>>the initial vm_bind support. Is it Ok to put it in the 'futues' section?
>>
>>>>+
>>>>+
>>>>+Broder i915 cleanups
>>>>+=====================
>>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>>+usecases to support and the locking requirements requires proper integration
>>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>>+Here are few things identified and are being looked into.
>>>>+
>>>>+- Make pagetable allocations evictable and manage them similar to VM_BIND
>>>>+  mapped objects. Page table pages are similar to persistent mappings of a
>>>>+  VM (difference here are that the page table pages will not
>>>>+  have an i915_vma structure and after swapping pages back in, parent page
>>>>+  link needs to be updated).
>>>
>>>See above, but I think this should be included as part of the initial
>>>vm_bind push.
>>>
>>
>>Ok, as you mentioned above, we can do it soon after initial vm_bind support
>>lands, but before we add any new vm_bind features.
>>
>>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>>+  do not use it and complexity it brings in is probably more than the
>>>>+  performance advantage we get in legacy execbuff case.
>>>>+- Remove vma->open_count counting
>>>>+- Remove i915_vma active reference tracking. Instead use underlying BO's
>>>>+  dma-resv fence list to determine if a i915_vma is active or not.
>>>
>>>So this is a complete mess, and really should not exist. I think it needs
>>>to be removed before we try to make i915_vma even more complex by adding
>>>vm_bind.
>>>
>>
>>Hmm...Need to look into this. I am not sure how much of an effort it is going
>>to be to remove i915_vma active reference tracking and instead use dma_resv
>>fences for activeness tracking.
>>
>>>The other thing I've been pondering here is that vm_bind is really
>>>completely different from legacy vm structures for a lot of reasons:
>>>- no relocation or softpin handling, which means vm_bind has no reason to
>>>ever look at the i915_vma structure in execbuf code. Unfortunately
>>>execbuf has been rewritten to be vma instead of obj centric, so it's a
>>>100% mismatch
>>>
>>>- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>>>that because the kernel manages the virtual address space fully. Again
>>>ideally that entire vma_move_to_active code and everything related to it
>>>would simply not exist.
>>>
>>>- similar on the eviction side, the rules are quite different: For vm_bind
>>>we never tear down the vma, instead it's just moved to the list of
>>>evicted vma. Legacy vm have no need for all these additional lists, so
>>>another huge confusion.
>>>
>>>- if the refcount is done correctly for vm_bind we wouldn't need the
>>>tricky code in the bo close paths. Unfortunately legacy vm with
>>>relocations and softpin require that vma are only a weak reference, so
>>>that cannot be removed.
>>>
>>>- there's also a ton of special cases for ggtt handling, like the
>>>different views (for display, partial views for mmap), but also the
>>>gen2/3 alignment and padding requirements which vm_bind never needs.
>>>
>>>I think the right thing here is to massively split the implementation
>>>behind some solid vm/vma abstraction, with a base clase for vm and vma
>>>which _only_ has the pieces which both vm_bind and the legacy vm stuff
>>>needs. But it's a bit tricky to get there. I think a workable path would
>>>be:
>>>- Add a new base class to both i915_address_space and i915_vma, which
>>>starts out empty.
>>>
>>>- As vm_bind code lands, move things that vm_bind code needs into these
>>>base classes
>>>
>>
>>Ok
>>
>>>- The goal should be that these base classes are a stand-alone library
>>>that other drivers could reuse. Like we've done with the buddy
>>>allocator, which first moved from i915-gem to i915-ttm, and which amd
>>>now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>>>interested in adding something like vm_bind should be involved from the
>>>start (or maybe the entire thing reused in amdgpu, they're looking at
>>>vk sparse binding support too or at least have perf issues I think).
>>>
>>>- Locking must be the same across all implemntations, otherwise it's
>>>really not an abstract. i915 screwed this up terribly by having
>>>different locking rules for ppgtt and ggtt, which is just nonsense.
>>>
>>>- The legacy specific code needs to be extracted as much as possible and
>>>shoved into separate files. In execbuf this means we need to get back to
>>>object centric flow, and the slowpaths need to become a lot simpler
>>>again (Maarten has cleaned up some of this, but there's still a silly
>>>amount of hacks in there with funny layering).
>>>
>>
>>This also, we can do soon after vm_bind code lands right?
>>
>>>- I think if stuff like the vma eviction details (list movement and
>>>locking and refcounting of the underlying object)
>>>
>>>>+
>>>>+These can be worked upon after intitial vm_bind support is added.
>>>
>>>I don't think that works, given how badly i915-gem team screwed up in
>>>other places. And those places had to be fixed by adopting shared code
>>>like ttm. Plus there's already a huge unfulffiled promise pending with the
>>>drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>>
>>
>>Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
>>active reference tracking code from i915 driver. Wonder if there is any
>>middle ground here like not using that in vm_bind mode?
>>
>>Niranjana
>>
>>>Cheers, Daniel
>>>
>>>>+
>>>>+
>>>>+UAPI
>>>>+=====
>>>>+Uapi definiton can be found here:
>>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>>index 91e93a705230..7d10c36b268d 100644
>>>>--- a/Documentation/gpu/rfc/index.rst
>>>>+++ b/Documentation/gpu/rfc/index.rst
>>>>@@ -23,3 +23,7 @@ host such documentation:
>>>>.. toctree::
>>>>
>>>>    i915_scheduler.rst
>>>>+
>>>>+.. toctree::
>>>>+
>>>>+    i915_vm_bind.rst
>>>>--
>>>>2.21.0.rc0.32.g243a4c7e27
>>>>
>>>
>>>--
>>>Daniel Vetter
>>>Software Engineer, Intel Corporation
>>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document
@ 2022-05-09 23:11           ` Niranjana Vishwanathapura
  0 siblings, 0 replies; 31+ messages in thread
From: Niranjana Vishwanathapura @ 2022-05-09 23:11 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, intel-gfx, chris.p.wilson, dri-devel, daniel.vetter,
	thomas.hellstrom, Christian König, Ben Skeggs

On Wed, Apr 27, 2022 at 08:41:35AM -0700, Niranjana Vishwanathapura wrote:
>On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
>>On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
>>>Adding a pile of people who've expressed interest in vm_bind for their
>>>drivers.
>>>
>>>Also note to the intel folks: This is largely written with me having my
>>>subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>>>here for the subsystem at large. There is substantial rework involved
>>>here, but it's not any different from i915 adopting ttm or i915 adpoting
>>>drm/sched, and I do think this stuff needs to happen in one form or
>>>another.
>>>
>>>On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>>>>VM_BIND design document with description of intended use cases.
>>>>
>>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>>---
>>>>Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>>>>Documentation/gpu/rfc/index.rst        |   4 +
>>>>2 files changed, 214 insertions(+)
>>>>create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>>>
>>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>>new file mode 100644
>>>>index 000000000000..cdc6bb25b942
>>>>--- /dev/null
>>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>>@@ -0,0 +1,210 @@
>>>>+==========================================
>>>>+I915 VM_BIND feature design and use cases
>>>>+==========================================
>>>>+
>>>>+VM_BIND feature
>>>>+================
>>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>>>>+a specified address space (VM).
>>>>+
>>>>+These mappings (also referred to as persistent mappings) will be persistent
>>>>+across multiple GPU submissions (execbuff) issued by the UMD, without user
>>>>+having to provide a list of all required mappings during each submission
>>>>+(as required by older execbuff mode).
>>>>+
>>>>+VM_BIND ioctl deferes binding the mappings until next execbuff submission
>>>>+where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>>>>+flag is set (useful if mapping is required for an active context).
>>>
>>>So this is a screw-up I've done, and for upstream I think we need to fix
>>>it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>>>I was wrong suggesting we should do this a few years back when we kicked
>>>this off internally :-(
>>>
>>>What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>>>things on top:
>>>- in and out fences, like with execbuf, to allow userspace to sync with
>>>execbuf as needed
>>>- for compute-mode context this means userspace memory fences
>>>- for legacy context this means a timeline syncobj in drm_syncobj
>>>
>>>No sync_file or anything else like this at all. This means a bunch of
>>>work, but also it'll have benefits because it means we should be able to
>>>use exactly the same code paths and logic for both compute and for legacy
>>>context, because drm_syncobj support future fence semantics.
>>>
>>
>>Thanks Daniel,
>>Ok, will update
>>
>
>I had a long conversation with Daniel on some of the points discussed here.
>Thanks to Daniel for clarifying many points here.
>
>Here is the summary of the discussion.
>
>1) A prep patch is needed to update documentation of some existing uapi and this
>  new VM_BIND uapi can update/refer to that.
>  I will include this prep patch in the next revision of this RFC series.
>  Will also include the uapi header file in the rst file so that it gets rendered.
>
>2) Will update documentation here with proper use of dma_resv_usage while adding
>  fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default
>  if not override with execlist in execbuff path.
>
>3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
>  instead of having to pass it as a BO handle in execlist. This will also make the
>  execlist usage solely for implicit sync setting which is further discussed below.
>
>4) Need to look into when will Jason's dma-buf fence import/export ioctl support will
>  land and whether it will be used both for vl and gl. Need to sync with Jason on this.
>  Probably the better option here would be to not support execlist in execbuff path in
>  vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export
>  ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode
>  later if required (say for gl).
>
>5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
>  relocations, implicit sync etc). Separate them out by using function pointers wherever
>  the functionality differs between current design and the newer VM_BIND design.
>
>6) Separate out i915_vma active reference counting in execbuff path and do not use it in
>  VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier
>  to get it working with the current TTM backend (which initial VM_BIND support will use).
>  And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
>
>7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires
>  support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler
>  conversion.
>
>Will revise this series accordingly.
>

I was prototyping some of these and they look good.
Still need to address few opens on dma-resv fence usage for VM_BIND. Like,
how to effectively update fence list during VM_BIND (for non VM private
objects).

I will be addressing these review comments and hoping to post updated
patch series by the end of this week or so.

Thanks,
Niranjana

>Thanks,
>Niranjana
>
>>>Also on the implementation side we still need to install dma_fence to the
>>>various dma_resv, and for this we need the new dma_resv_usage series from
>>>Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>>>flag to make sure they never result in an oversync issue with execbuf. I
>>>don't think trying to land vm_bind without that prep work in
>>>dma_resv_usage makes sense.
>>>
>>
>>Ok, but that is not a dependency for this VM_BIND design RFC patch right?
>>I will add this to the documentation here.
>>
>>>Also as soon as dma_resv_usage has landed there's a few cleanups we should
>>>do in i915:
>>>- ttm bo moving code should probably simplify a bit (and maybe more of the
>>>code should be pushed as helpers into ttm)
>>>- clflush code should be moved over to using USAGE_KERNEL and the various
>>>hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>>>expand on the kernel-doc for cache_dirty") for a bit more context
>>>
>>>This is still not yet enough, since if a vm_bind races with an eviction we
>>>might stall on the new buffers being readied first before the context can
>>>continue. This needs some care to make sure that vma which aren't fully
>>>bound yet are on a separate list, and vma which are marked for unbinding
>>>are removed from the main working set list as soon as possible.
>>>
>>>All of these things are relevant for the uapi semantics, which means
>>>- they need to be documented in the uapi kerneldoc, ideally with example
>>>flows
>>>- umd need to ack this
>>>
>>
>>Ok
>>
>>>The other thing here is the async/nonblocking path. I think we still need
>>>that one, but again it should not sync with anything going on in execbuf,
>>>but simply execute the ioctl code in a kernel thread. The idea here is
>>>that this works like a special gpu engine, so that compute and vk can
>>>schedule bindings interleaved with rendering. This should be enough to get
>>>a performant vk sparse binding/textures implementation.
>>>
>>>But I'm not entirely sure on this one, so this definitely needs acks from
>>>umds.
>>>
>>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>>+A VM in VM_BIND mode will not support older execbuff mode of binding.
>>>>+
>>>>+UMDs can still send BOs of these persistent mappings in execlist of execbuff
>>>>+for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>>>>+but those BOs should be mapped ahead via vm_bind ioctl.
>>>
>>>should or must?
>>>
>>
>>Must, will fix.
>>
>>>Also I'm not really sure that's a great interface. The batchbuffer really
>>>only needs to be an address, so maybe all we need is an extension to
>>>supply an u64 batchbuffer address instead of trying to retrofit this into
>>>an unfitting current uapi.
>>>
>>
>>Yah, this was considered, but was decided to do it as later optimization.
>>But if we were to remove execlist entries completely (ie., no implicit
>>sync also), then we need to do this from the beginning.
>>
>>>And for implicit sync there's two things:
>>>- for vk I think the right uapi is the dma-buf fence import/export ioctls
>>>from Jason Ekstrand. I think we should land that first instead of
>>>hacking funny concepts together
>>
>>I did not understand fully, can you point to it?
>>
>>>- for gl the dma-buf import/export might not be fast enough, since gl
>>>needs to do a _lot_ of implicit sync. There we might need to use the
>>>execbuffer buffer list, but then we should have extremely clear uapi
>>>rules which disallow _everything_ except setting the explicit sync uapi
>>>
>>
>>Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
>>
>>>Again all this stuff needs to be documented in detail in the kerneldoc
>>>uapi spec.
>>>
>>
>>ok
>>
>>>>+VM_BIND features include,
>>>>+- Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>>+  of an object (aliasing).
>>>>+- VA mapping can map to a partial section of the BO (partial binding).
>>>>+- Support capture of persistent mappings in the dump upon GPU error.
>>>>+- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>>+  usecases will be helpful.
>>>>+- Asynchronous vm_bind and vm_unbind support.
>>>>+- VM_BIND uses user/memory fence mechanism for signaling bind completion
>>>>+  and for signaling batch completion in long running contexts (explained
>>>>+  below).
>>>
>>>This should all be in the kerneldoc.
>>>
>>
>>ok
>>
>>>>+VM_PRIVATE objects
>>>>+------------------
>>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>>+During each execbuff submission, the request fence must be added to the
>>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>>+
>>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>>+the VM they are private to and can't be dma-buf exported.
>>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>>+submission, they need only one dma-resv fence list updated. Thus the fast
>>>>+path (where required mappings are already bound) submission latency is O(1)
>>>>+w.r.t the number of VM private BOs.
>>>
>>>Two things:
>>>
>>>- I think the above is required to for initial vm_bind for vk, it kinda
>>>doesn't make much sense without that, and will allow us to match amdgpu
>>>and radeonsi
>>>
>>>- Christian König just landed ttm bulk lru helpers, and I think we need to
>>>use those. This means vm_bind will only work with the ttm backend, but
>>>that's what we have for the big dgpu where vm_bind helps more in terms
>>>of performance, and the igfx conversion to ttm is already going on.
>>>
>>
>>ok
>>
>>>Furthermore the i915 shrinker lru has stopped being an lru, so I think
>>>that should also be moved over to the ttm lru in some fashion to make sure
>>>we once again have a reasonable and consistent memory aging and reclaim
>>>architecture. The current code is just too much of a complete mess.
>>>
>>>And since this is all fairly integral to how the code arch works I don't
>>>think merging a different version which isn't based on ttm bulk lru
>>>helpers makes sense.
>>>
>>>Also I do think the page table lru handling needs to be included here,
>>>because that's another complete hand-rolled separate world for not much
>>>good reasons. I guess that can happen in parallel with the initial vm_bind
>>>bring-up, but it needs to be completed by the time we add the features
>>>beyond the initial support needed for vk.
>>>
>>
>>Ok
>>
>>>>+VM_BIND locking hirarchy
>>>>+-------------------------
>>>>+VM_BIND locking order is as below.
>>>>+
>>>>+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>>>>+   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>>>>+
>>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>>+   rwsem instead, so that multiple pagefault handlers can take the read side
>>>>+   lock to lookup the mapping and hence can run in parallel.
>>>>+
>>>>+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>>>>+   while binding a vma and while updating dma-resv fence list of a BO.
>>>>+   The private BOs of a VM will all share a dma-resv object.
>>>>+
>>>>+   This lock is held in vm_bind call for immediate binding, during vm_unbind
>>>>+   call for unbinding and during execbuff path for binding the mapping and
>>>>+   updating the dma-resv fence list of the BO.
>>>>+
>>>>+3) Spinlock/s to protect some of the VM's lists.
>>>>+
>>>>+We will also need support for bluk LRU movement of persistent mapping to
>>>>+avoid additional latencies in execbuff path.
>>>
>>>This needs more detail and explanation of how each level is required. Also
>>>the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>>
>>>Like "some of the VM's lists" explains pretty much nothing.
>>>
>>
>>Ok, will explain.
>>
>>>>+
>>>>+GPU page faults
>>>>+----------------
>>>>+Both older execbuff mode and the newer VM_BIND mode of binding will require
>>>>+using dma-fence to ensure residency.
>>>>+In future when GPU page faults are supported, no dma-fence usage is required
>>>>+as residency is purely managed by installing and removing/invalidating ptes.
>>>
>>>This is a bit confusing. I think one part of this should be moved into the
>>>section with future vm_bind use-cases (we're not going to support page
>>>faults with legacy softpin or even worse, relocations). The locking
>>>discussion should be part of the much longer list of uses cases that
>>>motivate the locking design.
>>>
>>
>>Ok, will move.
>>
>>>>+
>>>>+
>>>>+User/Memory Fence
>>>>+==================
>>>>+The idea is to take a user specified virtual address and install an interrupt
>>>>+handler to wake up the current task when the memory location passes the user
>>>>+supplied filter.
>>>>+
>>>>+User/Memory fence is a <address, value> pair. To signal the user fence,
>>>>+specified value will be written at the specified virtual address and
>>>>+wakeup the waiting process. User can wait on an user fence with the
>>>>+gem_wait_user_fence ioctl.
>>>>+
>>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>>+interrupt within their batches after updating the value to have sub-batch
>>>>+precision on the wakeup. Each batch can signal an user fence to indicate
>>>>+the completion of next level batch. The completion of very first level batch
>>>>+needs to be signaled by the command streamer. The user must provide the
>>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>>+signal it.
>>>>+
>>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>>+the user process after completion of an asynchronous operation.
>>>>+
>>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>>+signaling of user/memory fence also indicate the completion of all previous
>>>>+binds/unbinds.
>>>>+
>>>>+This feature will be derived from the below original work:
>>>>+https://patchwork.freedesktop.org/patch/349417/
>>>
>>>This is 1:1 tied to long running compute mode contexts (which in the uapi
>>>doc must reference the endless amounts of bikeshed summary we have in the
>>>docs about indefinite fences).
>>>
>>
>>Ok, will check and add reference.
>>
>>>I'd put this into a new section about compute and userspace memory fences
>>>support, with this and the next chapter ...
>>
>>ok
>>
>>>>+
>>>>+
>>>>+VM_BIND use cases
>>>>+==================
>>>
>>>... and then make this section here focus entirely on additional vm_bind
>>>use-cases that we'll be adding later on. Which doesn't need to go into any
>>>details, it's just justification for why we want to build the world on top
>>>of vm_bind.
>>>
>>
>>ok
>>
>>>>+
>>>>+Long running Compute contexts
>>>>+------------------------------
>>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>>+in user fence. Compute must opt-in for this mechanism with
>>>>+I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>>>>+
>>>>+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>>>>+and implicit dependency setting is not allowed on long running contexts.
>>>>+
>>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>>+attached to it. And upon completion of that suspend fence, finish the
>>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>>+done by having a per-context fence (called suspend fence) proxying as
>>>>+i915_request fence. This suspend fence is enabled when there is a wait on it,
>>>>+which triggers the context preemption.
>>>>+
>>>>+This is much easier to support with VM_BIND compared to the current heavier
>>>>+execbuff path resource attachment.
>>>
>>>There's a bunch of tricky code around compute mode context support, like
>>>the preempt ctx fence (or suspend fence or whatever you want to call it),
>>>and the resume work. And I think that code should be shared across
>>>drivers.
>>>
>>>I think the right place to put this is into drm/sched, somewhere attached
>>>to the drm_sched_entity structure. I expect i915 folks to collaborate with
>>>amd and ideally also get amdkfd to adopt the same thing if possible. At
>>>least Christian has mentioned in the past that he's a bit unhappy about
>>>how this works.
>>>
>>>Also drm/sched has dependency tracking, which will be needed to pipeline
>>>context resume operations. That needs to be used instead of i915-gem
>>>inventing yet another dependency tracking data structure (it already has 3
>>>and that's roughly 3 too many).
>>>
>>>This means compute mode support and userspace memory fences are blocked on
>>>the drm/sched conversion, but *eh* add it to the list of reasons for why
>>>drm/sched needs to happen.
>>>
>>>Also since we only have support for compute mode ctx in our internal tree
>>>with the guc scheduler backend anyway, and the first conversion target is
>>>the guc backend, I don't think this actually holds up a lot of the code.
>>>
>>
>>Hmm...ok. Currently, the context suspend and resume operations in out
>>internal tree is through an orthogonal guc interface (not through scheduler).
>>So, I need to look more into this part.
>>
>>>>+Low Latency Submission
>>>>+-----------------------
>>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>>+ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>>
>>>This is really just a special case of compute mode contexts, I think I'd
>>>include that in there, but explain better what it requires (i.e. vm_bind
>>>not being synchronized against execbuf).
>>>
>>
>>ok
>>
>>>>+
>>>>+Debugger
>>>>+---------
>>>>+With debug event interface user space process (debugger) is able to keep track
>>>>+of and act upon resources created by another process (debuggee) and attached
>>>>+to GPU via vm_bind interface.
>>>>+
>>>>+Mesa/Valkun
>>>>+------------
>>>>+VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>>>>+performance. For Vulkan it should be straightforward to use VM_BIND.
>>>>+For Iris implicit buffer tracking must be implemented before we can harness
>>>>+VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>>>>+overhead becomes more important.
>>>
>>>Just to clarify, I don't think we can land vm_bind into upstream if it
>>>doesn't work 100% for vk. There's a bit much "can" instead of "will in
>>>this section".
>>>
>>
>>ok, will explain better.
>>
>>>>+
>>>>+Page level hints settings
>>>>+--------------------------
>>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>>+Possible hints include read-only, placement and atomicity.
>>>>+Sub-BO level placement hint will be even more relevant with
>>>>+upcoming GPU on-demand page fault support.
>>>>+
>>>>+Page level Cache/CLOS settings
>>>>+-------------------------------
>>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>>+
>>>>+Shared Virtual Memory (SVM) support
>>>>+------------------------------------
>>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>>+abstraction) using the HMM interface.
>>>
>>>Userptr is absent here (and it's not the same as svm, at least on
>>>discrete), and this is needed for the initial version since otherwise vk
>>>can't use it because we're not at feature parity.
>>>
>>
>>userptr gem objects are supported in initial version (and yes it is not
>>same as SVM). I did not add it here as there is no additional uapi
>>change required to support that.
>>
>>>Irc discussions by Maarten and Dave came up with the idea that maybe
>>>userptr for vm_bind should work _without_ any gem bo as backing storage,
>>>since that guarantees that people don't come up with funny ideas like
>>>trying to share such bo across process or mmap it and other nonsense which
>>>just doesn't work.
>>>
>>
>>Hmm...there is no plan to support userptr _without_ gem bo not atleast in
>>the initial vm_bind support. Is it Ok to put it in the 'futues' section?
>>
>>>>+
>>>>+
>>>>+Broder i915 cleanups
>>>>+=====================
>>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>>+usecases to support and the locking requirements requires proper integration
>>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>>+Here are few things identified and are being looked into.
>>>>+
>>>>+- Make pagetable allocations evictable and manage them similar to VM_BIND
>>>>+  mapped objects. Page table pages are similar to persistent mappings of a
>>>>+  VM (difference here are that the page table pages will not
>>>>+  have an i915_vma structure and after swapping pages back in, parent page
>>>>+  link needs to be updated).
>>>
>>>See above, but I think this should be included as part of the initial
>>>vm_bind push.
>>>
>>
>>Ok, as you mentioned above, we can do it soon after initial vm_bind support
>>lands, but before we add any new vm_bind features.
>>
>>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>>+  do not use it and complexity it brings in is probably more than the
>>>>+  performance advantage we get in legacy execbuff case.
>>>>+- Remove vma->open_count counting
>>>>+- Remove i915_vma active reference tracking. Instead use underlying BO's
>>>>+  dma-resv fence list to determine if a i915_vma is active or not.
>>>
>>>So this is a complete mess, and really should not exist. I think it needs
>>>to be removed before we try to make i915_vma even more complex by adding
>>>vm_bind.
>>>
>>
>>Hmm...Need to look into this. I am not sure how much of an effort it is going
>>to be to remove i915_vma active reference tracking and instead use dma_resv
>>fences for activeness tracking.
>>
>>>The other thing I've been pondering here is that vm_bind is really
>>>completely different from legacy vm structures for a lot of reasons:
>>>- no relocation or softpin handling, which means vm_bind has no reason to
>>>ever look at the i915_vma structure in execbuf code. Unfortunately
>>>execbuf has been rewritten to be vma instead of obj centric, so it's a
>>>100% mismatch
>>>
>>>- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>>>that because the kernel manages the virtual address space fully. Again
>>>ideally that entire vma_move_to_active code and everything related to it
>>>would simply not exist.
>>>
>>>- similar on the eviction side, the rules are quite different: For vm_bind
>>>we never tear down the vma, instead it's just moved to the list of
>>>evicted vma. Legacy vm have no need for all these additional lists, so
>>>another huge confusion.
>>>
>>>- if the refcount is done correctly for vm_bind we wouldn't need the
>>>tricky code in the bo close paths. Unfortunately legacy vm with
>>>relocations and softpin require that vma are only a weak reference, so
>>>that cannot be removed.
>>>
>>>- there's also a ton of special cases for ggtt handling, like the
>>>different views (for display, partial views for mmap), but also the
>>>gen2/3 alignment and padding requirements which vm_bind never needs.
>>>
>>>I think the right thing here is to massively split the implementation
>>>behind some solid vm/vma abstraction, with a base clase for vm and vma
>>>which _only_ has the pieces which both vm_bind and the legacy vm stuff
>>>needs. But it's a bit tricky to get there. I think a workable path would
>>>be:
>>>- Add a new base class to both i915_address_space and i915_vma, which
>>>starts out empty.
>>>
>>>- As vm_bind code lands, move things that vm_bind code needs into these
>>>base classes
>>>
>>
>>Ok
>>
>>>- The goal should be that these base classes are a stand-alone library
>>>that other drivers could reuse. Like we've done with the buddy
>>>allocator, which first moved from i915-gem to i915-ttm, and which amd
>>>now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>>>interested in adding something like vm_bind should be involved from the
>>>start (or maybe the entire thing reused in amdgpu, they're looking at
>>>vk sparse binding support too or at least have perf issues I think).
>>>
>>>- Locking must be the same across all implemntations, otherwise it's
>>>really not an abstract. i915 screwed this up terribly by having
>>>different locking rules for ppgtt and ggtt, which is just nonsense.
>>>
>>>- The legacy specific code needs to be extracted as much as possible and
>>>shoved into separate files. In execbuf this means we need to get back to
>>>object centric flow, and the slowpaths need to become a lot simpler
>>>again (Maarten has cleaned up some of this, but there's still a silly
>>>amount of hacks in there with funny layering).
>>>
>>
>>This also, we can do soon after vm_bind code lands right?
>>
>>>- I think if stuff like the vma eviction details (list movement and
>>>locking and refcounting of the underlying object)
>>>
>>>>+
>>>>+These can be worked upon after intitial vm_bind support is added.
>>>
>>>I don't think that works, given how badly i915-gem team screwed up in
>>>other places. And those places had to be fixed by adopting shared code
>>>like ttm. Plus there's already a huge unfulffiled promise pending with the
>>>drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>>
>>
>>Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma
>>active reference tracking code from i915 driver. Wonder if there is any
>>middle ground here like not using that in vm_bind mode?
>>
>>Niranjana
>>
>>>Cheers, Daniel
>>>
>>>>+
>>>>+
>>>>+UAPI
>>>>+=====
>>>>+Uapi definiton can be found here:
>>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>>index 91e93a705230..7d10c36b268d 100644
>>>>--- a/Documentation/gpu/rfc/index.rst
>>>>+++ b/Documentation/gpu/rfc/index.rst
>>>>@@ -23,3 +23,7 @@ host such documentation:
>>>>.. toctree::
>>>>
>>>>    i915_scheduler.rst
>>>>+
>>>>+.. toctree::
>>>>+
>>>>+    i915_vm_bind.rst
>>>>--
>>>>2.21.0.rc0.32.g243a4c7e27
>>>>
>>>
>>>--
>>>Daniel Vetter
>>>Software Engineer, Intel Corporation
>>>http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-05-09 23:11 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-07 20:31 [RFC v2 0/2] drm/doc/rfc: i915 VM_BIND feature design + uapi Niranjana Vishwanathapura
2022-03-07 20:31 ` [Intel-gfx] " Niranjana Vishwanathapura
2022-03-07 20:31 ` [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document Niranjana Vishwanathapura
2022-03-07 20:31   ` [Intel-gfx] " Niranjana Vishwanathapura
2022-03-09 15:58   ` Alex Deucher
2022-03-09 15:58     ` [Intel-gfx] " Alex Deucher
2022-04-21  2:08     ` Niranjana Vishwanathapura
2022-04-21  2:08       ` [Intel-gfx] " Niranjana Vishwanathapura
2022-03-31  8:28   ` Daniel Vetter
2022-03-31  8:28     ` Daniel Vetter
2022-03-31 11:37     ` Daniel Vetter
2022-03-31 11:37       ` [Intel-gfx] " Daniel Vetter
2022-04-20 22:50       ` Niranjana Vishwanathapura
2022-04-20 22:50         ` [Intel-gfx] " Niranjana Vishwanathapura
2022-04-27 13:53         ` Daniel Vetter
2022-04-27 13:53           ` [Intel-gfx] " Daniel Vetter
2022-04-20 22:45     ` Niranjana Vishwanathapura
2022-04-20 22:45       ` [Intel-gfx] " Niranjana Vishwanathapura
2022-04-27 15:41       ` Niranjana Vishwanathapura
2022-04-27 15:41         ` [Intel-gfx] " Niranjana Vishwanathapura
2022-04-28 12:29         ` Daniel Vetter
2022-04-28 12:29           ` [Intel-gfx] " Daniel Vetter
2022-05-09 23:11         ` Niranjana Vishwanathapura
2022-05-09 23:11           ` [Intel-gfx] " Niranjana Vishwanathapura
2022-03-07 20:31 ` [RFC v2 2/2] drm/doc/rfc: VM_BIND uapi definition Niranjana Vishwanathapura
2022-03-07 20:31   ` [Intel-gfx] " Niranjana Vishwanathapura
2022-03-30 12:51   ` Daniel Vetter
2022-04-20 20:18     ` Niranjana Vishwanathapura
2022-03-07 20:38 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/doc/rfc: i915 VM_BIND feature design + uapi (rev2) Patchwork
2022-03-07 20:43 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2022-03-08 12:13 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.