On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote: > On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: > > On 02/06/2022 23:35, Jason Ekstrand wrote: > > > > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura > > wrote: > > > > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: > > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: > > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: > > >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding > > the mapping in an > > >> > +async worker. The binding and unbinding will work like a > special > > GPU engine. > > >> > +The binding and unbinding operations are serialized and will > > wait on specified > > >> > +input fences before the operation and will signal the output > > fences upon the > > >> > +completion of the operation. Due to serialization, > completion of > > an operation > > >> > +will also indicate that all previous operations are also > > complete. > > >> > > >> I guess we should avoid saying "will immediately start > > binding/unbinding" if > > >> there are fences involved. > > >> > > >> And the fact that it's happening in an async worker seem to > imply > > it's not > > >> immediate. > > >> > > > > Ok, will fix. > > This was added because in earlier design binding was deferred until > > next execbuff. > > But now it is non-deferred (immediate in that sense). But yah, > this is > > confusing > > and will fix it. > > > > >> > > >> I have a question on the behavior of the bind operation when no > > input fence > > >> is provided. Let say I do : > > >> > > >> VM_BIND (out_fence=fence1) > > >> > > >> VM_BIND (out_fence=fence2) > > >> > > >> VM_BIND (out_fence=fence3) > > >> > > >> > > >> In what order are the fences going to be signaled? > > >> > > >> In the order of VM_BIND ioctls? Or out of order? > > >> > > >> Because you wrote "serialized I assume it's : in order > > >> > > > > Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and > unbind > > will use > > the same queue and hence are ordered. > > > > >> > > >> One thing I didn't realize is that because we only get one > > "VM_BIND" engine, > > >> there is a disconnect from the Vulkan specification. > > >> > > >> In Vulkan VM_BIND operations are serialized but per engine. > > >> > > >> So you could have something like this : > > >> > > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) > > >> > > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) > > >> > > >> > > >> fence1 is not signaled > > >> > > >> fence3 is signaled > > >> > > >> So the second VM_BIND will proceed before the first VM_BIND. > > >> > > >> > > >> I guess we can deal with that scenario in userspace by doing the > > wait > > >> ourselves in one thread per engines. > > >> > > >> But then it makes the VM_BIND input fences useless. > > >> > > >> > > >> Daniel : what do you think? Should be rework this or just deal > with > > wait > > >> fences in userspace? > > >> > > > > > >My opinion is rework this but make the ordering via an engine > param > > optional. > > > > > >e.g. A VM can be configured so all binds are ordered within the VM > > > > > >e.g. A VM can be configured so all binds accept an engine argument > > (in > > >the case of the i915 likely this is a gem context handle) and > binds > > >ordered with respect to that engine. > > > > > >This gives UMDs options as the later likely consumes more KMD > > resources > > >so if a different UMD can live with binds being ordered within > the VM > > >they can use a mode consuming less resources. > > > > > > > I think we need to be careful here if we are looking for some out > of > > (submission) order completion of vm_bind/unbind. > > In-order completion means, in a batch of binds and unbinds to be > > completed in-order, user only needs to specify in-fence for the > > first bind/unbind call and the our-fence for the last bind/unbind > > call. Also, the VA released by an unbind call can be re-used by > > any subsequent bind call in that in-order batch. > > > > These things will break if binding/unbinding were to be allowed to > > go out of order (of submission) and user need to be extra careful > > not to run into pre-mature triggereing of out-fence and bind > failing > > as VA is still in use etc. > > > > Also, VM_BIND binds the provided mapping on the specified address > > space > > (VM). So, the uapi is not engine/context specific. > > > > We can however add a 'queue' to the uapi which can be one from the > > pre-defined queues, > > I915_VM_BIND_QUEUE_0 > > I915_VM_BIND_QUEUE_1 > > ... > > I915_VM_BIND_QUEUE_(N-1) > > > > KMD will spawn an async work queue for each queue which will only > > bind the mappings on that queue in the order of submission. > > User can assign the queue to per engine or anything like that. > > > > But again here, user need to be careful and not deadlock these > > queues with circular dependency of fences. > > > > I prefer adding this later an as extension based on whether it > > is really helping with the implementation. > > > > I can tell you right now that having everything on a single in-order > > queue will not get us the perf we want. What vulkan really wants is > one > > of two things: > > 1. No implicit ordering of VM_BIND ops. They just happen in > whatever > > their dependencies are resolved and we ensure ordering ourselves by > > having a syncobj in the VkQueue. > > 2. The ability to create multiple VM_BIND queues. We need at least > 2 > > but I don't see why there needs to be a limit besides the limits the > > i915 API already has on the number of engines. Vulkan could expose > > multiple sparse binding queues to the client if it's not arbitrarily > > limited. > > Thanks Jason, Lionel. > > Jason, what are you referring to when you say "limits the i915 API already > has on the number of engines"? I am not sure if there is such an uapi > today. > There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all. I am trying to see how many queues we need and don't want it to be > arbitrarily > large and unduely blow up memory usage and complexity in i915 driver. > I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry. > > Why? Because Vulkan has two basic kind of bind operations and we > don't > > want any dependencies between them: > > 1. Immediate. These happen right after BO creation or maybe as > part of > > vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a > > queue and we don't want them serialized with anything. To > synchronize > > with submit, we'll have a syncobj in the VkDevice which is signaled > by > > all immediate bind operations and make submits wait on it. > > 2. Queued (sparse): These happen on a VkQueue which may be the same > as > > a render/compute queue or may be its own queue. It's up to us what > we > > want to advertise. From the Vulkan API PoV, this is like any other > > queue. Operations on it wait on and signal semaphores. If we have a > > VM_BIND engine, we'd provide syncobjs to wait and signal just like > we do > > in execbuf(). > > The important thing is that we don't want one type of operation to > block > > on the other. If immediate binds are blocking on sparse binds, it's > > going to cause over-synchronization issues. > > In terms of the internal implementation, I know that there's going > to be > > a lock on the VM and that we can't actually do these things in > > parallel. That's fine. Once the dma_fences have signaled and we're > > Thats correct. It is like a single VM_BIND engine with multiple queues > feeding to it. > Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine. > > unblocked to do the bind operation, I don't care if there's a bit of > > synchronization due to locking. That's expected. What we can't > afford > > to have is an immediate bind operation suddenly blocking on a sparse > > operation which is blocked on a compute job that's going to run for > > another 5ms. > > As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND > on other VMs. I am not sure about usecases here, but just wanted to > clarify. > Yes, that's what I would expect. --Jason > Niranjana > > > For reference, Windows solves this by allowing arbitrarily many > paging > > queues (what they call a VM_BIND engine/queue). That design works > > pretty well and solves the problems in question. Again, we could > just > > make everything out-of-order and require using syncobjs to order > things > > as userspace wants. That'd be fine too. > > One more note while I'm here: danvet said something on IRC about > VM_BIND > > queues waiting for syncobjs to materialize. We don't really > want/need > > this. We already have all the machinery in userspace to handle > > wait-before-signal and waiting for syncobj fences to materialize and > > that machinery is on by default. It would actually take MORE work in > > Mesa to turn it off and take advantage of the kernel being able to > wait > > for syncobjs to materialize. Also, getting that right is > ridiculously > > hard and I really don't want to get it wrong in kernel space. When > we > > do memory fences, wait-before-signal will be a thing. We don't need > to > > try and make it a thing for syncobj. > > --Jason > > > > Thanks Jason, > > > > I missed the bit in the Vulkan spec that we're allowed to have a sparse > > queue that does not implement either graphics or compute operations : > > > > "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT > > support in queue families that also include > > > > graphics and compute support, other implementations may only expose > a > > VK_QUEUE_SPARSE_BINDING_BIT-only queue > > > > family." > > > > So it can all be all a vm_bind engine that just does bind/unbind > > operations. > > > > But yes we need another engine for the immediate/non-sparse operations. > > > > -Lionel > > > > > > > > Daniel, any thoughts? > > > > Niranjana > > > > >Matt > > > > > >> > > >> Sorry I noticed this late. > > >> > > >> > > >> -Lionel > > >> > > >> >