Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
To: Jason Ekstrand <jason@jlekstrand.net>,
	Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: "Intel GFX" <intel-gfx@lists.freedesktop.org>,
	"Chris Wilson" <chris.p.wilson@intel.com>,
	"Thomas Hellstrom" <thomas.hellstrom@intel.com>,
	"Maling list - DRI developers" <dri-devel@lists.freedesktop.org>,
	"Daniel Vetter" <daniel.vetter@intel.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
Date: Fri, 3 Jun 2022 10:20:25 +0300	[thread overview]
Message-ID: <bd615d4e-3911-a9ce-5d9f-fb85f7866d32@intel.com> (raw)
In-Reply-To: <CAOFGe94AXn_vqON++LpiCTqOspCrVZawcYmjL3W6A7tA5vjTpQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9238 bytes --]

On 02/06/2022 23:35, Jason Ekstrand wrote:
> On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura 
> <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>     >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding the mapping in an
>     >> > +async worker. The binding and unbinding will work like a
>     special GPU engine.
>     >> > +The binding and unbinding operations are serialized and will
>     wait on specified
>     >> > +input fences before the operation and will signal the output
>     fences upon the
>     >> > +completion of the operation. Due to serialization,
>     completion of an operation
>     >> > +will also indicate that all previous operations are also
>     complete.
>     >>
>     >> I guess we should avoid saying "will immediately start
>     binding/unbinding" if
>     >> there are fences involved.
>     >>
>     >> And the fact that it's happening in an async worker seem to
>     imply it's not
>     >> immediate.
>     >>
>
>     Ok, will fix.
>     This was added because in earlier design binding was deferred
>     until next execbuff.
>     But now it is non-deferred (immediate in that sense). But yah,
>     this is confusing
>     and will fix it.
>
>     >>
>     >> I have a question on the behavior of the bind operation when no
>     input fence
>     >> is provided. Let say I do :
>     >>
>     >> VM_BIND (out_fence=fence1)
>     >>
>     >> VM_BIND (out_fence=fence2)
>     >>
>     >> VM_BIND (out_fence=fence3)
>     >>
>     >>
>     >> In what order are the fences going to be signaled?
>     >>
>     >> In the order of VM_BIND ioctls? Or out of order?
>     >>
>     >> Because you wrote "serialized I assume it's : in order
>     >>
>
>     Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind will use
>     the same queue and hence are ordered.
>
>     >>
>     >> One thing I didn't realize is that because we only get one
>     "VM_BIND" engine,
>     >> there is a disconnect from the Vulkan specification.
>     >>
>     >> In Vulkan VM_BIND operations are serialized but per engine.
>     >>
>     >> So you could have something like this :
>     >>
>     >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >>
>     >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >>
>     >>
>     >> fence1 is not signaled
>     >>
>     >> fence3 is signaled
>     >>
>     >> So the second VM_BIND will proceed before the first VM_BIND.
>     >>
>     >>
>     >> I guess we can deal with that scenario in userspace by doing
>     the wait
>     >> ourselves in one thread per engines.
>     >>
>     >> But then it makes the VM_BIND input fences useless.
>     >>
>     >>
>     >> Daniel : what do you think? Should be rework this or just deal
>     with wait
>     >> fences in userspace?
>     >>
>     >
>     >My opinion is rework this but make the ordering via an engine
>     param optional.
>     >
>     >e.g. A VM can be configured so all binds are ordered within the VM
>     >
>     >e.g. A VM can be configured so all binds accept an engine
>     argument (in
>     >the case of the i915 likely this is a gem context handle) and binds
>     >ordered with respect to that engine.
>     >
>     >This gives UMDs options as the later likely consumes more KMD
>     resources
>     >so if a different UMD can live with binds being ordered within the VM
>     >they can use a mode consuming less resources.
>     >
>
>     I think we need to be careful here if we are looking for some out of
>     (submission) order completion of vm_bind/unbind.
>     In-order completion means, in a batch of binds and unbinds to be
>     completed in-order, user only needs to specify in-fence for the
>     first bind/unbind call and the our-fence for the last bind/unbind
>     call. Also, the VA released by an unbind call can be re-used by
>     any subsequent bind call in that in-order batch.
>
>     These things will break if binding/unbinding were to be allowed to
>     go out of order (of submission) and user need to be extra careful
>     not to run into pre-mature triggereing of out-fence and bind failing
>     as VA is still in use etc.
>
>     Also, VM_BIND binds the provided mapping on the specified address
>     space
>     (VM). So, the uapi is not engine/context specific.
>
>     We can however add a 'queue' to the uapi which can be one from the
>     pre-defined queues,
>     I915_VM_BIND_QUEUE_0
>     I915_VM_BIND_QUEUE_1
>     ...
>     I915_VM_BIND_QUEUE_(N-1)
>
>     KMD will spawn an async work queue for each queue which will only
>     bind the mappings on that queue in the order of submission.
>     User can assign the queue to per engine or anything like that.
>
>     But again here, user need to be careful and not deadlock these
>     queues with circular dependency of fences.
>
>     I prefer adding this later an as extension based on whether it
>     is really helping with the implementation.
>
>
> I can tell you right now that having everything on a single in-order 
> queue will not get us the perf we want. What vulkan really wants is 
> one of two things:
>
>  1. No implicit ordering of VM_BIND ops.  They just happen in whatever 
> their dependencies are resolved and we ensure ordering ourselves by 
> having a syncobj in the VkQueue.
>
>  2. The ability to create multiple VM_BIND queues.  We need at least 2 
> but I don't see why there needs to be a limit besides the limits the 
> i915 API already has on the number of engines.  Vulkan could expose 
> multiple sparse binding queues to the client if it's not arbitrarily 
> limited.
>
> Why?  Because Vulkan has two basic kind of bind operations and we 
> don't want any dependencies between them:
>
>  1. Immediate.  These happen right after BO creation or maybe as part 
> of vkBindImageMemory() or VkBindBufferMemory().  These don't happen on 
> a queue and we don't want them serialized with anything.  To 
> synchronize with submit, we'll have a syncobj in the VkDevice which is 
> signaled by all immediate bind operations and make submits wait on it.
>
>  2. Queued (sparse): These happen on a VkQueue which may be the same 
> as a render/compute queue or may be its own queue.  It's up to us what 
> we want to advertise.  From the Vulkan API PoV, this is like any other 
> queue.  Operations on it wait on and signal semaphores.  If we have a 
> VM_BIND engine, we'd provide syncobjs to wait and signal just like we 
> do in execbuf().
>
> The important thing is that we don't want one type of operation to 
> block on the other.  If immediate binds are blocking on sparse binds, 
> it's going to cause over-synchronization issues.
>
> In terms of the internal implementation, I know that there's going to 
> be a lock on the VM and that we can't actually do these things in 
> parallel.  That's fine.  Once the dma_fences have signaled and we're 
> unblocked to do the bind operation, I don't care if there's a bit of 
> synchronization due to locking.  That's expected.  What we can't 
> afford to have is an immediate bind operation suddenly blocking on a 
> sparse operation which is blocked on a compute job that's going to run 
> for another 5ms.
>
> For reference, Windows solves this by allowing arbitrarily many paging 
> queues (what they call a VM_BIND engine/queue).  That design works 
> pretty well and solves the problems in question.  Again, we could just 
> make everything out-of-order and require using syncobjs to order 
> things as userspace wants. That'd be fine too.
>
> One more note while I'm here: danvet said something on IRC about 
> VM_BIND queues waiting for syncobjs to materialize.  We don't really 
> want/need this.  We already have all the machinery in userspace to 
> handle wait-before-signal and waiting for syncobj fences to 
> materialize and that machinery is on by default.  It would actually 
> take MORE work in Mesa to turn it off and take advantage of the kernel 
> being able to wait for syncobjs to materialize.  Also, getting that 
> right is ridiculously hard and I really don't want to get it wrong in 
> kernel space. When we do memory fences, wait-before-signal will be a 
> thing.  We don't need to try and make it a thing for syncobj.
>
> --Jason

Thanks Jason,

I missed the bit in the Vulkan spec that we're allowed to have a sparse 
queue that does not implement either graphics or compute operations :

    "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
    support in queue families that also include

      graphics and compute support, other implementations may only
    expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue

      family."

So it can all be all a vm_bind engine that just does bind/unbind 
operations.

But yes we need another engine for the immediate/non-sparse operations.

-Lionel

>     Daniel, any thoughts?
>
>     Niranjana
>
>     >Matt
>     >
>     >>
>     >> Sorry I noticed this late.
>     >>
>     >>
>     >> -Lionel
>     >>
>     >>
>

[-- Attachment #2: Type: text/html, Size: 14209 bytes --]