From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8E71AC43334 for ; Wed, 8 Jun 2022 21:55:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6633810FE53; Wed, 8 Jun 2022 21:55:52 +0000 (UTC) Received: from mail-yb1-xb36.google.com (mail-yb1-xb36.google.com [IPv6:2607:f8b0:4864:20::b36]) by gabe.freedesktop.org (Postfix) with ESMTPS id 35D8C10FE53 for ; Wed, 8 Jun 2022 21:55:50 +0000 (UTC) Received: by mail-yb1-xb36.google.com with SMTP id p13so38760248ybm.1 for ; Wed, 08 Jun 2022 14:55:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jlekstrand-net.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cJVBOdmsN6f3z8ci8C2aFiGa9xLfEjM1kOXTi5SCP8M=; b=h5vGYooJ/IBbiuKXHtrhlj87nQdK51DG8ynxjGQc977YC+mqXPVA5ilt6n76vS+g5F Gm4QU0RbalcuJ/vtlqcQv+0umXet+vqQcgJP2ZSB065rqRukCbZeXDwyEnWyk9fVsB/2 NLySFhoO1c805Fdf6Uoq7SNs0POZDRZh4/9LqcESHdmicgGhAoQmjWRGDFtNvBB8hJeN YTG6EP2S0q94qSjGBxjgdsunl3VRMyCNcN10yPusjfqIFvgjqFbjAzWnbVGhKG8gOCbO CJ/oidvbb0XffGLatE3Sbk2cG1EWIYeeEoh0G+sRoNctSIJ5r6BMw3McuH+ONuq84oFB ukGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cJVBOdmsN6f3z8ci8C2aFiGa9xLfEjM1kOXTi5SCP8M=; b=JronhTQoUpsHO3sG/qugIoYXQTBbjLW7wJ3TRNWZxOptAIsY0xYE7EekwrXHI1wvy8 nMcH7Eu6DWxEs+uDqOtuesZWLoCCgiefSzeg5JVS7r8AZASGhPj2uiu1HkMEa5b6F6XM ndZaKEtRc9ZXX/tONw3GkVoH2fXT81nHKBBTn9Z3iF2DXK5X+sPdRCZP6GAplgGR4BJ5 4C4IyoUEslicUWDzckE4ELGWQxCCDbkDYYn3kiW5D0A30xTBaIIxAoY9ryH3LiN2dUgC VJVr7dZJjqMURxNWMzrmUH3uSfziYzpqptAxr6aGJ1LL4iupyGGc7CqAo0XcQIL9yyIx RVXg== X-Gm-Message-State: AOAM5318vyyTCZ6z3H4hz99m1OrAXlgShfxsqkcThdHW8rAJiwqnqTuD 9rAm9AzINlIY9jC+1pK/IVqsV6FkSMiDUsqvMzzSyA== X-Google-Smtp-Source: ABdhPJxWtKUlp0jKuHj1Wf9N+Jci9TpYeHa2mgj0BmOlwyNkmoDjC1cSOVClSeDpvBaeqoFvgQTte0oBgr8TDC37M74= X-Received: by 2002:a25:910e:0:b0:65c:caa7:52b6 with SMTP id v14-20020a25910e000000b0065ccaa752b6mr35178793ybl.577.1654725349109; Wed, 08 Jun 2022 14:55:49 -0700 (PDT) MIME-Version: 1.0 References: <43746609-4f60-f347-5934-6680516297dd@intel.com> <20220601202836.GA15346@jons-linux-dev-box> <20220602201112.GQ4461@nvishwa1-DESK> <20220603235148.GU4461@nvishwa1-DESK> <20220607181810.GV4461@nvishwa1-DESK> <20220607213209.GY4461@nvishwa1-DESK> <4be022cc-518e-49e1-96bd-b9720a313401@linux.intel.com> <20220608214431.GD4461@nvishwa1-DESK> In-Reply-To: <20220608214431.GD4461@nvishwa1-DESK> From: Jason Ekstrand Date: Wed, 8 Jun 2022 16:55:38 -0500 Message-ID: Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document To: Niranjana Vishwanathapura Content-Type: multipart/alternative; boundary="000000000000a2837d05e0f6c4c8" X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tvrtko Ursulin , Intel GFX , Chris Wilson , Thomas Hellstrom , Maling list - DRI developers , Daniel Vetter , =?UTF-8?Q?Christian_K=C3=B6nig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" --000000000000a2837d05e0f6c4c8 Content-Type: text/plain; charset="UTF-8" On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote: > On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > > > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: > >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura > wrote: > >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: > >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura > >>>> wrote: > >>>> > >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: > >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: > >>>> > > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura > >>>> > wrote: > >>>> > > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew > >>>>Brost wrote: > >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin > >>>> wrote: > >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: > >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start > >>>> binding/unbinding > >>>> > the mapping in an > >>>> > >> > +async worker. The binding and unbinding will > >>>>work like a > >>>> special > >>>> > GPU engine. > >>>> > >> > +The binding and unbinding operations are serialized > and > >>>> will > >>>> > wait on specified > >>>> > >> > +input fences before the operation and will signal the > >>>> output > >>>> > fences upon the > >>>> > >> > +completion of the operation. Due to serialization, > >>>> completion of > >>>> > an operation > >>>> > >> > +will also indicate that all previous operations > >>>>are also > >>>> > complete. > >>>> > >> > >>>> > >> I guess we should avoid saying "will immediately start > >>>> > binding/unbinding" if > >>>> > >> there are fences involved. > >>>> > >> > >>>> > >> And the fact that it's happening in an async > >>>>worker seem to > >>>> imply > >>>> > it's not > >>>> > >> immediate. > >>>> > >> > >>>> > > >>>> > Ok, will fix. > >>>> > This was added because in earlier design binding was > deferred > >>>> until > >>>> > next execbuff. > >>>> > But now it is non-deferred (immediate in that sense). > >>>>But yah, > >>>> this is > >>>> > confusing > >>>> > and will fix it. > >>>> > > >>>> > >> > >>>> > >> I have a question on the behavior of the bind > >>>>operation when > >>>> no > >>>> > input fence > >>>> > >> is provided. Let say I do : > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence1) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence3) > >>>> > >> > >>>> > >> > >>>> > >> In what order are the fences going to be signaled? > >>>> > >> > >>>> > >> In the order of VM_BIND ioctls? Or out of order? > >>>> > >> > >>>> > >> Because you wrote "serialized I assume it's : in order > >>>> > >> > >>>> > > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that > >>>>bind and > >>>> unbind > >>>> > will use > >>>> > the same queue and hence are ordered. > >>>> > > >>>> > >> > >>>> > >> One thing I didn't realize is that because we only get > one > >>>> > "VM_BIND" engine, > >>>> > >> there is a disconnect from the Vulkan specification. > >>>> > >> > >>>> > >> In Vulkan VM_BIND operations are serialized but > >>>>per engine. > >>>> > >> > >>>> > >> So you could have something like this : > >>>> > >> > >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) > >>>> > >> > >>>> > >> > >>>> > >> fence1 is not signaled > >>>> > >> > >>>> > >> fence3 is signaled > >>>> > >> > >>>> > >> So the second VM_BIND will proceed before the > >>>>first VM_BIND. > >>>> > >> > >>>> > >> > >>>> > >> I guess we can deal with that scenario in > >>>>userspace by doing > >>>> the > >>>> > wait > >>>> > >> ourselves in one thread per engines. > >>>> > >> > >>>> > >> But then it makes the VM_BIND input fences useless. > >>>> > >> > >>>> > >> > >>>> > >> Daniel : what do you think? Should be rework this or just > >>>> deal with > >>>> > wait > >>>> > >> fences in userspace? > >>>> > >> > >>>> > > > >>>> > >My opinion is rework this but make the ordering via > >>>>an engine > >>>> param > >>>> > optional. > >>>> > > > >>>> > >e.g. A VM can be configured so all binds are ordered > >>>>within the > >>>> VM > >>>> > > > >>>> > >e.g. A VM can be configured so all binds accept an engine > >>>> argument > >>>> > (in > >>>> > >the case of the i915 likely this is a gem context > >>>>handle) and > >>>> binds > >>>> > >ordered with respect to that engine. > >>>> > > > >>>> > >This gives UMDs options as the later likely consumes > >>>>more KMD > >>>> > resources > >>>> > >so if a different UMD can live with binds being > >>>>ordered within > >>>> the VM > >>>> > >they can use a mode consuming less resources. > >>>> > > > >>>> > > >>>> > I think we need to be careful here if we are looking for > some > >>>> out of > >>>> > (submission) order completion of vm_bind/unbind. > >>>> > In-order completion means, in a batch of binds and > >>>>unbinds to be > >>>> > completed in-order, user only needs to specify > >>>>in-fence for the > >>>> > first bind/unbind call and the our-fence for the last > >>>> bind/unbind > >>>> > call. Also, the VA released by an unbind call can be > >>>>re-used by > >>>> > any subsequent bind call in that in-order batch. > >>>> > > >>>> > These things will break if binding/unbinding were to > >>>>be allowed > >>>> to > >>>> > go out of order (of submission) and user need to be extra > >>>> careful > >>>> > not to run into pre-mature triggereing of out-fence and bind > >>>> failing > >>>> > as VA is still in use etc. > >>>> > > >>>> > Also, VM_BIND binds the provided mapping on the specified > >>>> address > >>>> > space > >>>> > (VM). So, the uapi is not engine/context specific. > >>>> > > >>>> > We can however add a 'queue' to the uapi which can be > >>>>one from > >>>> the > >>>> > pre-defined queues, > >>>> > I915_VM_BIND_QUEUE_0 > >>>> > I915_VM_BIND_QUEUE_1 > >>>> > ... > >>>> > I915_VM_BIND_QUEUE_(N-1) > >>>> > > >>>> > KMD will spawn an async work queue for each queue which will > >>>> only > >>>> > bind the mappings on that queue in the order of submission. > >>>> > User can assign the queue to per engine or anything > >>>>like that. > >>>> > > >>>> > But again here, user need to be careful and not > >>>>deadlock these > >>>> > queues with circular dependency of fences. > >>>> > > >>>> > I prefer adding this later an as extension based on > >>>>whether it > >>>> > is really helping with the implementation. > >>>> > > >>>> > I can tell you right now that having everything on a single > >>>> in-order > >>>> > queue will not get us the perf we want. What vulkan > >>>>really wants > >>>> is one > >>>> > of two things: > >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in > >>>> whatever > >>>> > their dependencies are resolved and we ensure ordering > >>>>ourselves > >>>> by > >>>> > having a syncobj in the VkQueue. > >>>> > 2. The ability to create multiple VM_BIND queues. We need at > >>>> least 2 > >>>> > but I don't see why there needs to be a limit besides > >>>>the limits > >>>> the > >>>> > i915 API already has on the number of engines. Vulkan could > >>>> expose > >>>> > multiple sparse binding queues to the client if it's not > >>>> arbitrarily > >>>> > limited. > >>>> > >>>> Thanks Jason, Lionel. > >>>> > >>>> Jason, what are you referring to when you say "limits the i915 API > >>>> already > >>>> has on the number of engines"? I am not sure if there is such an > uapi > >>>> today. > >>>> > >>>> There's a limit of something like 64 total engines today based on the > >>>> number of bits we can cram into the exec flags in execbuffer2. I > think > >>>> someone had an extended version that allowed more but I ripped it out > >>>> because no one was using it. Of course, execbuffer3 might not > >>>>have that > >>>> problem at all. > >>>> > >>> > >>>Thanks Jason. > >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably > >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE > >>>and somehow export it to user (I am thinking of embedding it in > >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n > >>>queues. > >> > >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which > execbuf3 > Yup! That's exactly the limit I was talking about. > >>will also have. So, we can simply define in vm_bind/unbind structures, > >> > >>#define I915_VM_BIND_MAX_QUEUE 64 > >> __u32 queue; > >> > >>I think that will keep things simple. > > > >Hmmm? What does execbuf2 limit has to do with how many engines > >hardware can have? I suggest not to do that. > > > >Change with added this: > > > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > > return -EINVAL; > > > >To context creation needs to be undone and so let users create engine > >maps with all hardware engines, and let execbuf3 access them all. > > > > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. > Hence, I was using the same limit for VM_BIND queues (64, or 65 if we > make it N+1). > But, as discussed in other thread of this RFC series, we are planning > to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be > any uapi that limits the number of engines (and hence the vm_bind queues > need to be supported). > > If we leave the number of vm_bind queues to be arbitrarily large > (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, > work_item and a linked list) lookup from the user specified queue index. > Other option is to just put some hard limit (say 64 or 65) and use > an array of queues in VM (each created upon first use). I prefer this. > I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason > > Niranjana > > >Regards, > > > >Tvrtko > > > >> > >>Niranjana > >> > >>> > >>>> I am trying to see how many queues we need and don't want it to be > >>>> arbitrarily > >>>> large and unduely blow up memory usage and complexity in i915 > driver. > >>>> > >>>> I expect a Vulkan driver to use at most 2 in the vast majority > >>>>of cases. I > >>>> could imagine a client wanting to create more than 1 sparse > >>>>queue in which > >>>> case, it'll be N+1 but that's unlikely. As far as complexity > >>>>goes, once > >>>> you allow two, I don't think the complexity is going up by > >>>>allowing N. As > >>>> for memory usage, creating more queues means more memory. That's a > >>>> trade-off that userspace can make. Again, the expected number > >>>>here is 1 > >>>> or 2 in the vast majority of cases so I don't think you need to worry. > >>> > >>>Ok, will start with n=3 meaning 8 queues. > >>>That would require us create 8 workqueues. > >>>We can change 'n' later if required. > >>> > >>>Niranjana > >>> > >>>> > >>>> > Why? Because Vulkan has two basic kind of bind > >>>>operations and we > >>>> don't > >>>> > want any dependencies between them: > >>>> > 1. Immediate. These happen right after BO creation or > >>>>maybe as > >>>> part of > >>>> > vkBindImageMemory() or VkBindBufferMemory(). These > >>>>don't happen > >>>> on a > >>>> > queue and we don't want them serialized with anything. To > >>>> synchronize > >>>> > with submit, we'll have a syncobj in the VkDevice which is > >>>> signaled by > >>>> > all immediate bind operations and make submits wait on it. > >>>> > 2. Queued (sparse): These happen on a VkQueue which may be > the > >>>> same as > >>>> > a render/compute queue or may be its own queue. It's up to us > >>>> what we > >>>> > want to advertise. From the Vulkan API PoV, this is like any > >>>> other > >>>> > queue. Operations on it wait on and signal semaphores. If we > >>>> have a > >>>> > VM_BIND engine, we'd provide syncobjs to wait and > >>>>signal just like > >>>> we do > >>>> > in execbuf(). > >>>> > The important thing is that we don't want one type of > >>>>operation to > >>>> block > >>>> > on the other. If immediate binds are blocking on sparse > binds, > >>>> it's > >>>> > going to cause over-synchronization issues. > >>>> > In terms of the internal implementation, I know that > >>>>there's going > >>>> to be > >>>> > a lock on the VM and that we can't actually do these things in > >>>> > parallel. That's fine. Once the dma_fences have signaled and > >>>> we're > >>>> > >>>> Thats correct. It is like a single VM_BIND engine with > >>>>multiple queues > >>>> feeding to it. > >>>> > >>>> Right. As long as the queues themselves are independent and > >>>>can block on > >>>> dma_fences without holding up other queues, I think we're fine. > >>>> > >>>> > unblocked to do the bind operation, I don't care if > >>>>there's a bit > >>>> of > >>>> > synchronization due to locking. That's expected. What > >>>>we can't > >>>> afford > >>>> > to have is an immediate bind operation suddenly blocking on a > >>>> sparse > >>>> > operation which is blocked on a compute job that's going to > run > >>>> for > >>>> > another 5ms. > >>>> > >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the > >>>> VM_BIND > >>>> on other VMs. I am not sure about usecases here, but just wanted to > >>>> clarify. > >>>> > >>>> Yes, that's what I would expect. > >>>> --Jason > >>>> > >>>> Niranjana > >>>> > >>>> > For reference, Windows solves this by allowing arbitrarily > many > >>>> paging > >>>> > queues (what they call a VM_BIND engine/queue). That > >>>>design works > >>>> > pretty well and solves the problems in question. > >>>>Again, we could > >>>> just > >>>> > make everything out-of-order and require using syncobjs > >>>>to order > >>>> things > >>>> > as userspace wants. That'd be fine too. > >>>> > One more note while I'm here: danvet said something on > >>>>IRC about > >>>> VM_BIND > >>>> > queues waiting for syncobjs to materialize. We don't really > >>>> want/need > >>>> > this. We already have all the machinery in userspace to > handle > >>>> > wait-before-signal and waiting for syncobj fences to > >>>>materialize > >>>> and > >>>> > that machinery is on by default. It would actually > >>>>take MORE work > >>>> in > >>>> > Mesa to turn it off and take advantage of the kernel > >>>>being able to > >>>> wait > >>>> > for syncobjs to materialize. Also, getting that right is > >>>> ridiculously > >>>> > hard and I really don't want to get it wrong in kernel > >>>>space. When we > >>>> > do memory fences, wait-before-signal will be a thing. We > don't > >>>> need to > >>>> > try and make it a thing for syncobj. > >>>> > --Jason > >>>> > > >>>> > Thanks Jason, > >>>> > > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a > >>>> sparse > >>>> > queue that does not implement either graphics or compute > >>>>operations > >>>> : > >>>> > > >>>> > "While some implementations may include > >>>> VK_QUEUE_SPARSE_BINDING_BIT > >>>> > support in queue families that also include > >>>> > > >>>> > graphics and compute support, other implementations may only > >>>> expose a > >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue > >>>> > > >>>> > family." > >>>> > > >>>> > So it can all be all a vm_bind engine that just does bind/unbind > >>>> > operations. > >>>> > > >>>> > But yes we need another engine for the immediate/non-sparse > >>>> operations. > >>>> > > >>>> > -Lionel > >>>> > > >>>> > > > >>>> > Daniel, any thoughts? > >>>> > > >>>> > Niranjana > >>>> > > >>>> > >Matt > >>>> > > > >>>> > >> > >>>> > >> Sorry I noticed this late. > >>>> > >> > >>>> > >> > >>>> > >> -Lionel > >>>> > >> > >>>> > >> > --000000000000a2837d05e0f6c4c8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@i= ntel.com> wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura= wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:=
>>>>=C2=A0On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathap= ura
>>>>=C2=A0<niranjana.vishwanathapura@intel.com> wrote:<= br> >>>>
>>>>=C2=A0=C2=A0 On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lion= el Landwerlin wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 On 02/06/2022 23:35, Jason Ek= strand wrote:
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 On Thu, Jun 2, 20= 22 at 3:11 PM Niranjana Vishwanathapura
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 <niranjana.vishw= anathapura@intel.com> wrote:
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 On We= d, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>Brost wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >O= n Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>>=C2=A0=C2=A0 wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +VM_BIND/UNBIND ioctl will immediately start
>>>>=C2=A0=C2=A0 binding/unbinding
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 the m= apping in an
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +async worker. The binding and unbinding will
>>>>work like a
>>>>=C2=A0=C2=A0 special
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GPU e= ngine.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +The binding and unbinding operations are serialized and
>>>>=C2=A0=C2=A0 will
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait = on specified
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +input fences before the operation and will signal the
>>>>=C2=A0=C2=A0 output
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 fence= s upon the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +completion of the operation. Due to serialization,
>>>>=C2=A0=C2=A0 completion of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 an op= eration
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +will also indicate that all previous operations
>>>>are also
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 compl= ete.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; I guess we should avoid saying "will immediately start
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bindi= ng/unbinding" if
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; there are fences involved.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; And the fact that it's happening in an async
>>>>worker seem to
>>>>=C2=A0=C2=A0 imply
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 it= 9;s not
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; immediate.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Ok, w= ill fix.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 This = was added because in earlier design binding was deferred
>>>>=C2=A0=C2=A0 until
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 next = execbuff.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 But n= ow it is non-deferred (immediate in that sense).
>>>>But yah,
>>>>=C2=A0=C2=A0 this is
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 confu= sing
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 and w= ill fix it.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; I have a question on the behavior of the bind
>>>>operation when
>>>>=C2=A0=C2=A0 no
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 input= fence
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; is provided. Let say I do :
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (out_fence=3Dfence1)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (out_fence=3Dfence2)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (out_fence=3Dfence3)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; In what order are the fences going to be signaled?
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; In the order of VM_BIND ioctls? Or out of order?
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; Because you wrote "serialized I assume it's : in order
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Yes, = in the order of VM_BIND/UNBIND ioctls. Note that
>>>>bind and
>>>>=C2=A0=C2=A0 unbind
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 will = use
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 the s= ame queue and hence are ordered.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; One thing I didn't realize is that because we only get one
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "= ;VM_BIND" engine,
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; there is a disconnect from the Vulkan specification.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; In Vulkan VM_BIND operations are serialized but
>>>>per engine.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; So you could have something like this :
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (engine=3Drcs0, in_fence=3Dfence1, out_fence=3Dfence2)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (engine=3Dccs0, in_fence=3Dfence3, out_fence=3Dfence4)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; fence1 is not signaled
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; fence3 is signaled
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; So the second VM_BIND will proceed before the
>>>>first VM_BIND.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; I guess we can deal with that scenario in
>>>>userspace by doing
>>>>=C2=A0=C2=A0 the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait<= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; ourselves in one thread per engines.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; But then it makes the VM_BIND input fences useless.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; Daniel : what do you think? Should be rework this or just
>>>>=C2=A0=C2=A0 deal with
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait<= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; fences in userspace?
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >M= y opinion is rework this but make the ordering via
>>>>an engine
>>>>=C2=A0=C2=A0 param
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 optio= nal.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >e= .g. A VM can be configured so all binds are ordered
>>>>within the
>>>>=C2=A0=C2=A0 VM
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >e= .g. A VM can be configured so all binds accept an engine
>>>>=C2=A0=C2=A0 argument
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (in >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >t= he case of the i915 likely this is a gem context
>>>>handle) and
>>>>=C2=A0=C2=A0 binds
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >o= rdered with respect to that engine.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >T= his gives UMDs options as the later likely consumes
>>>>more KMD
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 resou= rces
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >s= o if a different UMD can live with binds being
>>>>ordered within
>>>>=C2=A0=C2=A0 the VM
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >t= hey can use a mode consuming less resources.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I thi= nk we need to be careful here if we are looking for some
>>>>=C2=A0=C2=A0 out of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (subm= ission) order completion of vm_bind/unbind.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 In-or= der completion means, in a batch of binds and
>>>>unbinds to be
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 compl= eted in-order, user only needs to specify
>>>>in-fence for the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 first= bind/unbind call and the our-fence for the last
>>>>=C2=A0=C2=A0 bind/unbind
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 call.= Also, the VA released by an unbind call can be
>>>>re-used by
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 any s= ubsequent bind call in that in-order batch.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 These= things will break if binding/unbinding were to
>>>>be allowed
>>>>=C2=A0=C2=A0 to
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 go ou= t of order (of submission) and user need to be extra
>>>>=C2=A0=C2=A0 careful
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 not t= o run into pre-mature triggereing of out-fence and bind
>>>>=C2=A0=C2=A0 failing
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 as VA= is still in use etc.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Also,= VM_BIND binds the provided mapping on the specified
>>>>=C2=A0=C2=A0 address
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 space=
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (VM).= So, the uapi is not engine/context specific.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 We ca= n however add a 'queue' to the uapi which can be
>>>>one from
>>>>=C2=A0=C2=A0 the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pre-d= efined queues,
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I915_= VM_BIND_QUEUE_0
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I915_= VM_BIND_QUEUE_1
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ... >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I915_= VM_BIND_QUEUE_(N-1)
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 KMD w= ill spawn an async work queue for each queue which will
>>>>=C2=A0=C2=A0 only
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bind = the mappings on that queue in the order of submission.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 User = can assign the queue to per engine or anything
>>>>like that.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 But a= gain here, user need to be careful and not
>>>>deadlock these
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 queue= s with circular dependency of fences.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I pre= fer adding this later an as extension based on
>>>>whether it
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 is re= ally helping with the implementation.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 I can tell you ri= ght now that having everything on a single
>>>>=C2=A0=C2=A0 in-order
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queue will not ge= t us the perf we want.=C2=A0 What vulkan
>>>>really wants
>>>>=C2=A0=C2=A0 is one
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 of two things: >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1. No impli= cit ordering of VM_BIND ops.=C2=A0 They just happen in
>>>>=C2=A0=C2=A0 whatever
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 their dependencie= s are resolved and we ensure ordering
>>>>ourselves
>>>>=C2=A0=C2=A0 by
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 having a syncobj = in the VkQueue.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2. The abil= ity to create multiple VM_BIND queues.=C2=A0 We need at
>>>>=C2=A0=C2=A0 least 2
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 but I don't s= ee why there needs to be a limit besides
>>>>the limits
>>>>=C2=A0=C2=A0 the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 i915 API already = has on the number of engines.=C2=A0 Vulkan could
>>>>=C2=A0=C2=A0 expose
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 multiple sparse b= inding queues to the client if it's not
>>>>=C2=A0=C2=A0 arbitrarily
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 limited.
>>>>
>>>>=C2=A0=C2=A0 Thanks Jason, Lionel.
>>>>
>>>>=C2=A0=C2=A0 Jason, what are you referring to when you say = "limits the i915 API
>>>>=C2=A0=C2=A0 already
>>>>=C2=A0=C2=A0 has on the number of engines"? I am not s= ure if there is such an uapi
>>>>=C2=A0=C2=A0 today.
>>>>
>>>>=C2=A0There's a limit of something like 64 total engine= s today based on the
>>>>=C2=A0number of bits we can cram into the exec flags in exe= cbuffer2.=C2=A0 I think
>>>>=C2=A0someone had an extended version that allowed more but= I ripped it out
>>>>=C2=A0because no one was using it.=C2=A0 Of course, execbuf= fer3 might not
>>>>have that
>>>>=C2=A0problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3= probably
>>>will not have this limiation. So, we need to define a VM_BIND_M= AX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in=
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->= 'n' meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which e= xecbuf3

Yup!=C2=A0 That's exactly t= he limit I was talking about.
=C2=A0
>>will also have. So, we can simply define in vm_bind/unbind structur= es,
>>
>>#define I915_VM_BIND_MAX_QUEUE=C2=A0=C2=A0 64
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0if (set.num_engines > I915_EXEC_RING_MASK= + 1)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EINVAL;<= br> >
>To context creation needs to be undone and so let users create engine <= br> >maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.
<= /blockquote>

I don't get why a VM_BIND queue is any = different from any other queue or userspace-visible kernel object.=C2=A0 Bu= t I'll leave those details up to danvet or whoever else might be review= ing the implementation.

--Jason

=C2=A0

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>=C2=A0=C2=A0 I am trying to see how many queues we need and= don't want it to be
>>>>=C2=A0=C2=A0 arbitrarily
>>>>=C2=A0=C2=A0 large and unduely blow up memory usage and com= plexity in i915 driver.
>>>>
>>>>=C2=A0I expect a Vulkan driver to use at most 2 in the vast= majority
>>>>of cases. I
>>>>=C2=A0could imagine a client wanting to create more than 1 = sparse
>>>>queue in which
>>>>=C2=A0case, it'll be N+1 but that's unlikely.=C2=A0= As far as complexity
>>>>goes, once
>>>>=C2=A0you allow two, I don't think the complexity is go= ing up by
>>>>allowing N.=C2=A0 As
>>>>=C2=A0for memory usage, creating more queues means more mem= ory.=C2=A0 That's a
>>>>=C2=A0trade-off that userspace can make.=C2=A0 Again, the e= xpected number
>>>>here is 1
>>>>=C2=A0or 2 in the vast majority of cases so I don't thi= nk you need to worry.
>>>
>>>Ok, will start with n=3D3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 Why?=C2=A0 Becaus= e Vulkan has two basic kind of bind
>>>>operations and we
>>>>=C2=A0=C2=A0 don't
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 want any dependen= cies between them:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1. Immediat= e.=C2=A0 These happen right after BO creation or
>>>>maybe as
>>>>=C2=A0=C2=A0 part of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 vkBindImageMemory= () or VkBindBufferMemory().=C2=A0 These
>>>>don't happen
>>>>=C2=A0=C2=A0 on a
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queue and we don&= #39;t want them serialized with anything.=C2=A0 To
>>>>=C2=A0=C2=A0 synchronize
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 with submit, we&#= 39;ll have a syncobj in the VkDevice which is
>>>>=C2=A0=C2=A0 signaled by
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 all immediate bin= d operations and make submits wait on it.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2. Queued (= sparse): These happen on a VkQueue which may be the
>>>>=C2=A0=C2=A0 same as
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 a render/compute = queue or may be its own queue.=C2=A0 It's up to us
>>>>=C2=A0=C2=A0 what we
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 want to advertise= .=C2=A0 From the Vulkan API PoV, this is like any
>>>>=C2=A0=C2=A0 other
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queue.=C2=A0 Oper= ations on it wait on and signal semaphores.=C2=A0 If we
>>>>=C2=A0=C2=A0 have a
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 VM_BIND engine, w= e'd provide syncobjs to wait and
>>>>signal just like
>>>>=C2=A0=C2=A0 we do
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 in execbuf().
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 The important thi= ng is that we don't want one type of
>>>>operation to
>>>>=C2=A0=C2=A0 block
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 on the other.=C2= =A0 If immediate binds are blocking on sparse binds,
>>>>=C2=A0=C2=A0 it's
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 going to cause ov= er-synchronization issues.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 In terms of the i= nternal implementation, I know that
>>>>there's going
>>>>=C2=A0=C2=A0 to be
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 a lock on the VM = and that we can't actually do these things in
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 parallel.=C2=A0 T= hat's fine.=C2=A0 Once the dma_fences have signaled and
>>>>=C2=A0=C2=A0 we're
>>>>
>>>>=C2=A0=C2=A0 Thats correct. It is like a single VM_BIND eng= ine with
>>>>multiple queues
>>>>=C2=A0=C2=A0 feeding to it.
>>>>
>>>>=C2=A0Right.=C2=A0 As long as the queues themselves are ind= ependent and
>>>>can block on
>>>>=C2=A0dma_fences without holding up other queues, I think w= e're fine.
>>>>
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 unblocked to do t= he bind operation, I don't care if
>>>>there's a bit
>>>>=C2=A0=C2=A0 of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 synchronization d= ue to locking.=C2=A0 That's expected.=C2=A0 What
>>>>we can't
>>>>=C2=A0=C2=A0 afford
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 to have is an imm= ediate bind operation suddenly blocking on a
>>>>=C2=A0=C2=A0 sparse
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 operation which i= s blocked on a compute job that's going to run
>>>>=C2=A0=C2=A0 for
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 another 5ms.
>>>>
>>>>=C2=A0=C2=A0 As the VM_BIND queue is per VM, VM_BIND on one= VM doesn't block the
>>>>=C2=A0=C2=A0 VM_BIND
>>>>=C2=A0=C2=A0 on other VMs. I am not sure about usecases her= e, but just wanted to
>>>>=C2=A0=C2=A0 clarify.
>>>>
>>>>=C2=A0Yes, that's what I would expect.
>>>>=C2=A0--Jason
>>>>
>>>>=C2=A0=C2=A0 Niranjana
>>>>
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 For reference, Wi= ndows solves this by allowing arbitrarily many
>>>>=C2=A0=C2=A0 paging
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queues (what they= call a VM_BIND engine/queue).=C2=A0 That
>>>>design works
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 pretty well and s= olves the problems in question.=C2=A0
>>>>Again, we could
>>>>=C2=A0=C2=A0 just
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 make everything o= ut-of-order and require using syncobjs
>>>>to order
>>>>=C2=A0=C2=A0 things
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 as userspace want= s. That'd be fine too.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 One more note whi= le I'm here: danvet said something on
>>>>IRC about
>>>>=C2=A0=C2=A0 VM_BIND
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queues waiting fo= r syncobjs to materialize.=C2=A0 We don't really
>>>>=C2=A0=C2=A0 want/need
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 this.=C2=A0 We al= ready have all the machinery in userspace to handle
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 wait-before-signa= l and waiting for syncobj fences to
>>>>materialize
>>>>=C2=A0=C2=A0 and
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 that machinery is= on by default.=C2=A0 It would actually
>>>>take MORE work
>>>>=C2=A0=C2=A0 in
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 Mesa to turn it o= ff and take advantage of the kernel
>>>>being able to
>>>>=C2=A0=C2=A0 wait
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 for syncobjs to m= aterialize.=C2=A0 Also, getting that right is
>>>>=C2=A0=C2=A0 ridiculously
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 hard and I really= don't want to get it wrong in kernel
>>>>space.=C2=A0=C2=A0=C2=A0=C2=A0 When we
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 do memory fences,= wait-before-signal will be a thing.=C2=A0 We don't
>>>>=C2=A0=C2=A0 need to
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 try and make it a= thing for syncobj.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 --Jason
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 Thanks Jason,
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 I missed the bit in the Vulka= n spec that we're allowed to have a
>>>>=C2=A0=C2=A0 sparse
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 queue that does not implement= either graphics or compute
>>>>operations
>>>>=C2=A0=C2=A0 :
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 "While some = implementations may include
>>>>=C2=A0=C2=A0 VK_QUEUE_SPARSE_BINDING_BIT
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 support in queue = families that also include
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 graphics an= d compute support, other implementations may only
>>>>=C2=A0=C2=A0 expose a
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 VK_QUEUE_SPARSE_B= INDING_BIT-only queue
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 family.&quo= t;
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 So it can all be all a vm_bin= d engine that just does bind/unbind
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 operations.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 But yes we need another engin= e for the immediate/non-sparse
>>>>=C2=A0=C2=A0 operations.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 -Lionel
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Danie= l, any thoughts?
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Niran= jana
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >M= att
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; Sorry I noticed this late.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; -Lionel
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
--000000000000a2837d05e0f6c4c8-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1C33CC43334 for ; Wed, 8 Jun 2022 21:55:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 714EA10FD51; Wed, 8 Jun 2022 21:55:51 +0000 (UTC) Received: from mail-yb1-xb33.google.com (mail-yb1-xb33.google.com [IPv6:2607:f8b0:4864:20::b33]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3E2B110FE8A for ; Wed, 8 Jun 2022 21:55:50 +0000 (UTC) Received: by mail-yb1-xb33.google.com with SMTP id l204so38739587ybf.10 for ; Wed, 08 Jun 2022 14:55:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jlekstrand-net.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cJVBOdmsN6f3z8ci8C2aFiGa9xLfEjM1kOXTi5SCP8M=; b=h5vGYooJ/IBbiuKXHtrhlj87nQdK51DG8ynxjGQc977YC+mqXPVA5ilt6n76vS+g5F Gm4QU0RbalcuJ/vtlqcQv+0umXet+vqQcgJP2ZSB065rqRukCbZeXDwyEnWyk9fVsB/2 NLySFhoO1c805Fdf6Uoq7SNs0POZDRZh4/9LqcESHdmicgGhAoQmjWRGDFtNvBB8hJeN YTG6EP2S0q94qSjGBxjgdsunl3VRMyCNcN10yPusjfqIFvgjqFbjAzWnbVGhKG8gOCbO CJ/oidvbb0XffGLatE3Sbk2cG1EWIYeeEoh0G+sRoNctSIJ5r6BMw3McuH+ONuq84oFB ukGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cJVBOdmsN6f3z8ci8C2aFiGa9xLfEjM1kOXTi5SCP8M=; b=m0xcizJhARFUHE9ZHghiy2IuHV1O6QM20IZXRsbinqukhnRsyElGbL5gWorvStZ4iD LSoUyM6pSUmLNaTrPF03guV/o144Btn/uSfUV2KS2ZTcqOQ6VB/2D3EzkaUVaN7T5qNV WflMcKrOoa0mDuzIL/tkrWUSh7ufY0QKWzwNippf0e/S4EbGMjwoPiDRQzrPwxZIG2zL +Q7dDg+T/91D948I3Nxb5Qwv3IRluJu/Oxrls1Z3KdX5IwgGIyj5VqsfL6XwBsKigGQi RXGhu9rLWXgiQKPbGuaV576tOLEiigekQK97WOLH8V6moZh8ruDMmmkC3fWqEU97RYFy weRQ== X-Gm-Message-State: AOAM533XZOiiA1EfbYH1eIApv0ZNcVGlCKzRvvtQrJFDxEd7nQB9MN+U UvZcCsxCRtKaIP7z3gFRMjpShS76erEtZmxbR8wRqA== X-Google-Smtp-Source: ABdhPJxWtKUlp0jKuHj1Wf9N+Jci9TpYeHa2mgj0BmOlwyNkmoDjC1cSOVClSeDpvBaeqoFvgQTte0oBgr8TDC37M74= X-Received: by 2002:a25:910e:0:b0:65c:caa7:52b6 with SMTP id v14-20020a25910e000000b0065ccaa752b6mr35178793ybl.577.1654725349109; Wed, 08 Jun 2022 14:55:49 -0700 (PDT) MIME-Version: 1.0 References: <43746609-4f60-f347-5934-6680516297dd@intel.com> <20220601202836.GA15346@jons-linux-dev-box> <20220602201112.GQ4461@nvishwa1-DESK> <20220603235148.GU4461@nvishwa1-DESK> <20220607181810.GV4461@nvishwa1-DESK> <20220607213209.GY4461@nvishwa1-DESK> <4be022cc-518e-49e1-96bd-b9720a313401@linux.intel.com> <20220608214431.GD4461@nvishwa1-DESK> In-Reply-To: <20220608214431.GD4461@nvishwa1-DESK> From: Jason Ekstrand Date: Wed, 8 Jun 2022 16:55:38 -0500 Message-ID: To: Niranjana Vishwanathapura Content-Type: multipart/alternative; boundary="000000000000a2837d05e0f6c4c8" Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Intel GFX , Chris Wilson , Thomas Hellstrom , Maling list - DRI developers , Daniel Vetter , =?UTF-8?Q?Christian_K=C3=B6nig?= Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" --000000000000a2837d05e0f6c4c8 Content-Type: text/plain; charset="UTF-8" On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote: > On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > > > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: > >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura > wrote: > >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: > >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura > >>>> wrote: > >>>> > >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: > >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: > >>>> > > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura > >>>> > wrote: > >>>> > > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew > >>>>Brost wrote: > >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin > >>>> wrote: > >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: > >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start > >>>> binding/unbinding > >>>> > the mapping in an > >>>> > >> > +async worker. The binding and unbinding will > >>>>work like a > >>>> special > >>>> > GPU engine. > >>>> > >> > +The binding and unbinding operations are serialized > and > >>>> will > >>>> > wait on specified > >>>> > >> > +input fences before the operation and will signal the > >>>> output > >>>> > fences upon the > >>>> > >> > +completion of the operation. Due to serialization, > >>>> completion of > >>>> > an operation > >>>> > >> > +will also indicate that all previous operations > >>>>are also > >>>> > complete. > >>>> > >> > >>>> > >> I guess we should avoid saying "will immediately start > >>>> > binding/unbinding" if > >>>> > >> there are fences involved. > >>>> > >> > >>>> > >> And the fact that it's happening in an async > >>>>worker seem to > >>>> imply > >>>> > it's not > >>>> > >> immediate. > >>>> > >> > >>>> > > >>>> > Ok, will fix. > >>>> > This was added because in earlier design binding was > deferred > >>>> until > >>>> > next execbuff. > >>>> > But now it is non-deferred (immediate in that sense). > >>>>But yah, > >>>> this is > >>>> > confusing > >>>> > and will fix it. > >>>> > > >>>> > >> > >>>> > >> I have a question on the behavior of the bind > >>>>operation when > >>>> no > >>>> > input fence > >>>> > >> is provided. Let say I do : > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence1) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence3) > >>>> > >> > >>>> > >> > >>>> > >> In what order are the fences going to be signaled? > >>>> > >> > >>>> > >> In the order of VM_BIND ioctls? Or out of order? > >>>> > >> > >>>> > >> Because you wrote "serialized I assume it's : in order > >>>> > >> > >>>> > > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that > >>>>bind and > >>>> unbind > >>>> > will use > >>>> > the same queue and hence are ordered. > >>>> > > >>>> > >> > >>>> > >> One thing I didn't realize is that because we only get > one > >>>> > "VM_BIND" engine, > >>>> > >> there is a disconnect from the Vulkan specification. > >>>> > >> > >>>> > >> In Vulkan VM_BIND operations are serialized but > >>>>per engine. > >>>> > >> > >>>> > >> So you could have something like this : > >>>> > >> > >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) > >>>> > >> > >>>> > >> > >>>> > >> fence1 is not signaled > >>>> > >> > >>>> > >> fence3 is signaled > >>>> > >> > >>>> > >> So the second VM_BIND will proceed before the > >>>>first VM_BIND. > >>>> > >> > >>>> > >> > >>>> > >> I guess we can deal with that scenario in > >>>>userspace by doing > >>>> the > >>>> > wait > >>>> > >> ourselves in one thread per engines. > >>>> > >> > >>>> > >> But then it makes the VM_BIND input fences useless. > >>>> > >> > >>>> > >> > >>>> > >> Daniel : what do you think? Should be rework this or just > >>>> deal with > >>>> > wait > >>>> > >> fences in userspace? > >>>> > >> > >>>> > > > >>>> > >My opinion is rework this but make the ordering via > >>>>an engine > >>>> param > >>>> > optional. > >>>> > > > >>>> > >e.g. A VM can be configured so all binds are ordered > >>>>within the > >>>> VM > >>>> > > > >>>> > >e.g. A VM can be configured so all binds accept an engine > >>>> argument > >>>> > (in > >>>> > >the case of the i915 likely this is a gem context > >>>>handle) and > >>>> binds > >>>> > >ordered with respect to that engine. > >>>> > > > >>>> > >This gives UMDs options as the later likely consumes > >>>>more KMD > >>>> > resources > >>>> > >so if a different UMD can live with binds being > >>>>ordered within > >>>> the VM > >>>> > >they can use a mode consuming less resources. > >>>> > > > >>>> > > >>>> > I think we need to be careful here if we are looking for > some > >>>> out of > >>>> > (submission) order completion of vm_bind/unbind. > >>>> > In-order completion means, in a batch of binds and > >>>>unbinds to be > >>>> > completed in-order, user only needs to specify > >>>>in-fence for the > >>>> > first bind/unbind call and the our-fence for the last > >>>> bind/unbind > >>>> > call. Also, the VA released by an unbind call can be > >>>>re-used by > >>>> > any subsequent bind call in that in-order batch. > >>>> > > >>>> > These things will break if binding/unbinding were to > >>>>be allowed > >>>> to > >>>> > go out of order (of submission) and user need to be extra > >>>> careful > >>>> > not to run into pre-mature triggereing of out-fence and bind > >>>> failing > >>>> > as VA is still in use etc. > >>>> > > >>>> > Also, VM_BIND binds the provided mapping on the specified > >>>> address > >>>> > space > >>>> > (VM). So, the uapi is not engine/context specific. > >>>> > > >>>> > We can however add a 'queue' to the uapi which can be > >>>>one from > >>>> the > >>>> > pre-defined queues, > >>>> > I915_VM_BIND_QUEUE_0 > >>>> > I915_VM_BIND_QUEUE_1 > >>>> > ... > >>>> > I915_VM_BIND_QUEUE_(N-1) > >>>> > > >>>> > KMD will spawn an async work queue for each queue which will > >>>> only > >>>> > bind the mappings on that queue in the order of submission. > >>>> > User can assign the queue to per engine or anything > >>>>like that. > >>>> > > >>>> > But again here, user need to be careful and not > >>>>deadlock these > >>>> > queues with circular dependency of fences. > >>>> > > >>>> > I prefer adding this later an as extension based on > >>>>whether it > >>>> > is really helping with the implementation. > >>>> > > >>>> > I can tell you right now that having everything on a single > >>>> in-order > >>>> > queue will not get us the perf we want. What vulkan > >>>>really wants > >>>> is one > >>>> > of two things: > >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in > >>>> whatever > >>>> > their dependencies are resolved and we ensure ordering > >>>>ourselves > >>>> by > >>>> > having a syncobj in the VkQueue. > >>>> > 2. The ability to create multiple VM_BIND queues. We need at > >>>> least 2 > >>>> > but I don't see why there needs to be a limit besides > >>>>the limits > >>>> the > >>>> > i915 API already has on the number of engines. Vulkan could > >>>> expose > >>>> > multiple sparse binding queues to the client if it's not > >>>> arbitrarily > >>>> > limited. > >>>> > >>>> Thanks Jason, Lionel. > >>>> > >>>> Jason, what are you referring to when you say "limits the i915 API > >>>> already > >>>> has on the number of engines"? I am not sure if there is such an > uapi > >>>> today. > >>>> > >>>> There's a limit of something like 64 total engines today based on the > >>>> number of bits we can cram into the exec flags in execbuffer2. I > think > >>>> someone had an extended version that allowed more but I ripped it out > >>>> because no one was using it. Of course, execbuffer3 might not > >>>>have that > >>>> problem at all. > >>>> > >>> > >>>Thanks Jason. > >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably > >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE > >>>and somehow export it to user (I am thinking of embedding it in > >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n > >>>queues. > >> > >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which > execbuf3 > Yup! That's exactly the limit I was talking about. > >>will also have. So, we can simply define in vm_bind/unbind structures, > >> > >>#define I915_VM_BIND_MAX_QUEUE 64 > >> __u32 queue; > >> > >>I think that will keep things simple. > > > >Hmmm? What does execbuf2 limit has to do with how many engines > >hardware can have? I suggest not to do that. > > > >Change with added this: > > > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > > return -EINVAL; > > > >To context creation needs to be undone and so let users create engine > >maps with all hardware engines, and let execbuf3 access them all. > > > > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. > Hence, I was using the same limit for VM_BIND queues (64, or 65 if we > make it N+1). > But, as discussed in other thread of this RFC series, we are planning > to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be > any uapi that limits the number of engines (and hence the vm_bind queues > need to be supported). > > If we leave the number of vm_bind queues to be arbitrarily large > (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, > work_item and a linked list) lookup from the user specified queue index. > Other option is to just put some hard limit (say 64 or 65) and use > an array of queues in VM (each created upon first use). I prefer this. > I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason > > Niranjana > > >Regards, > > > >Tvrtko > > > >> > >>Niranjana > >> > >>> > >>>> I am trying to see how many queues we need and don't want it to be > >>>> arbitrarily > >>>> large and unduely blow up memory usage and complexity in i915 > driver. > >>>> > >>>> I expect a Vulkan driver to use at most 2 in the vast majority > >>>>of cases. I > >>>> could imagine a client wanting to create more than 1 sparse > >>>>queue in which > >>>> case, it'll be N+1 but that's unlikely. As far as complexity > >>>>goes, once > >>>> you allow two, I don't think the complexity is going up by > >>>>allowing N. As > >>>> for memory usage, creating more queues means more memory. That's a > >>>> trade-off that userspace can make. Again, the expected number > >>>>here is 1 > >>>> or 2 in the vast majority of cases so I don't think you need to worry. > >>> > >>>Ok, will start with n=3 meaning 8 queues. > >>>That would require us create 8 workqueues. > >>>We can change 'n' later if required. > >>> > >>>Niranjana > >>> > >>>> > >>>> > Why? Because Vulkan has two basic kind of bind > >>>>operations and we > >>>> don't > >>>> > want any dependencies between them: > >>>> > 1. Immediate. These happen right after BO creation or > >>>>maybe as > >>>> part of > >>>> > vkBindImageMemory() or VkBindBufferMemory(). These > >>>>don't happen > >>>> on a > >>>> > queue and we don't want them serialized with anything. To > >>>> synchronize > >>>> > with submit, we'll have a syncobj in the VkDevice which is > >>>> signaled by > >>>> > all immediate bind operations and make submits wait on it. > >>>> > 2. Queued (sparse): These happen on a VkQueue which may be > the > >>>> same as > >>>> > a render/compute queue or may be its own queue. It's up to us > >>>> what we > >>>> > want to advertise. From the Vulkan API PoV, this is like any > >>>> other > >>>> > queue. Operations on it wait on and signal semaphores. If we > >>>> have a > >>>> > VM_BIND engine, we'd provide syncobjs to wait and > >>>>signal just like > >>>> we do > >>>> > in execbuf(). > >>>> > The important thing is that we don't want one type of > >>>>operation to > >>>> block > >>>> > on the other. If immediate binds are blocking on sparse > binds, > >>>> it's > >>>> > going to cause over-synchronization issues. > >>>> > In terms of the internal implementation, I know that > >>>>there's going > >>>> to be > >>>> > a lock on the VM and that we can't actually do these things in > >>>> > parallel. That's fine. Once the dma_fences have signaled and > >>>> we're > >>>> > >>>> Thats correct. It is like a single VM_BIND engine with > >>>>multiple queues > >>>> feeding to it. > >>>> > >>>> Right. As long as the queues themselves are independent and > >>>>can block on > >>>> dma_fences without holding up other queues, I think we're fine. > >>>> > >>>> > unblocked to do the bind operation, I don't care if > >>>>there's a bit > >>>> of > >>>> > synchronization due to locking. That's expected. What > >>>>we can't > >>>> afford > >>>> > to have is an immediate bind operation suddenly blocking on a > >>>> sparse > >>>> > operation which is blocked on a compute job that's going to > run > >>>> for > >>>> > another 5ms. > >>>> > >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the > >>>> VM_BIND > >>>> on other VMs. I am not sure about usecases here, but just wanted to > >>>> clarify. > >>>> > >>>> Yes, that's what I would expect. > >>>> --Jason > >>>> > >>>> Niranjana > >>>> > >>>> > For reference, Windows solves this by allowing arbitrarily > many > >>>> paging > >>>> > queues (what they call a VM_BIND engine/queue). That > >>>>design works > >>>> > pretty well and solves the problems in question. > >>>>Again, we could > >>>> just > >>>> > make everything out-of-order and require using syncobjs > >>>>to order > >>>> things > >>>> > as userspace wants. That'd be fine too. > >>>> > One more note while I'm here: danvet said something on > >>>>IRC about > >>>> VM_BIND > >>>> > queues waiting for syncobjs to materialize. We don't really > >>>> want/need > >>>> > this. We already have all the machinery in userspace to > handle > >>>> > wait-before-signal and waiting for syncobj fences to > >>>>materialize > >>>> and > >>>> > that machinery is on by default. It would actually > >>>>take MORE work > >>>> in > >>>> > Mesa to turn it off and take advantage of the kernel > >>>>being able to > >>>> wait > >>>> > for syncobjs to materialize. Also, getting that right is > >>>> ridiculously > >>>> > hard and I really don't want to get it wrong in kernel > >>>>space. When we > >>>> > do memory fences, wait-before-signal will be a thing. We > don't > >>>> need to > >>>> > try and make it a thing for syncobj. > >>>> > --Jason > >>>> > > >>>> > Thanks Jason, > >>>> > > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a > >>>> sparse > >>>> > queue that does not implement either graphics or compute > >>>>operations > >>>> : > >>>> > > >>>> > "While some implementations may include > >>>> VK_QUEUE_SPARSE_BINDING_BIT > >>>> > support in queue families that also include > >>>> > > >>>> > graphics and compute support, other implementations may only > >>>> expose a > >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue > >>>> > > >>>> > family." > >>>> > > >>>> > So it can all be all a vm_bind engine that just does bind/unbind > >>>> > operations. > >>>> > > >>>> > But yes we need another engine for the immediate/non-sparse > >>>> operations. > >>>> > > >>>> > -Lionel > >>>> > > >>>> > > > >>>> > Daniel, any thoughts? > >>>> > > >>>> > Niranjana > >>>> > > >>>> > >Matt > >>>> > > > >>>> > >> > >>>> > >> Sorry I noticed this late. > >>>> > >> > >>>> > >> > >>>> > >> -Lionel > >>>> > >> > >>>> > >> > --000000000000a2837d05e0f6c4c8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@i= ntel.com> wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura= wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:=
>>>>=C2=A0On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathap= ura
>>>>=C2=A0<niranjana.vishwanathapura@intel.com> wrote:<= br> >>>>
>>>>=C2=A0=C2=A0 On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lion= el Landwerlin wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 On 02/06/2022 23:35, Jason Ek= strand wrote:
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 On Thu, Jun 2, 20= 22 at 3:11 PM Niranjana Vishwanathapura
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 <niranjana.vishw= anathapura@intel.com> wrote:
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 On We= d, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>Brost wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >O= n Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>>=C2=A0=C2=A0 wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +VM_BIND/UNBIND ioctl will immediately start
>>>>=C2=A0=C2=A0 binding/unbinding
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 the m= apping in an
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +async worker. The binding and unbinding will
>>>>work like a
>>>>=C2=A0=C2=A0 special
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 GPU e= ngine.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +The binding and unbinding operations are serialized and
>>>>=C2=A0=C2=A0 will
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait = on specified
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +input fences before the operation and will signal the
>>>>=C2=A0=C2=A0 output
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 fence= s upon the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +completion of the operation. Due to serialization,
>>>>=C2=A0=C2=A0 completion of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 an op= eration
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; > +will also indicate that all previous operations
>>>>are also
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 compl= ete.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; I guess we should avoid saying "will immediately start
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bindi= ng/unbinding" if
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; there are fences involved.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; And the fact that it's happening in an async
>>>>worker seem to
>>>>=C2=A0=C2=A0 imply
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 it= 9;s not
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; immediate.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Ok, w= ill fix.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 This = was added because in earlier design binding was deferred
>>>>=C2=A0=C2=A0 until
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 next = execbuff.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 But n= ow it is non-deferred (immediate in that sense).
>>>>But yah,
>>>>=C2=A0=C2=A0 this is
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 confu= sing
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 and w= ill fix it.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; I have a question on the behavior of the bind
>>>>operation when
>>>>=C2=A0=C2=A0 no
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 input= fence
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; is provided. Let say I do :
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (out_fence=3Dfence1)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (out_fence=3Dfence2)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (out_fence=3Dfence3)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; In what order are the fences going to be signaled?
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; In the order of VM_BIND ioctls? Or out of order?
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; Because you wrote "serialized I assume it's : in order
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Yes, = in the order of VM_BIND/UNBIND ioctls. Note that
>>>>bind and
>>>>=C2=A0=C2=A0 unbind
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 will = use
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 the s= ame queue and hence are ordered.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; One thing I didn't realize is that because we only get one
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 "= ;VM_BIND" engine,
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; there is a disconnect from the Vulkan specification.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; In Vulkan VM_BIND operations are serialized but
>>>>per engine.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; So you could have something like this :
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (engine=3Drcs0, in_fence=3Dfence1, out_fence=3Dfence2)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; VM_BIND (engine=3Dccs0, in_fence=3Dfence3, out_fence=3Dfence4)
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; fence1 is not signaled
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; fence3 is signaled
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; So the second VM_BIND will proceed before the
>>>>first VM_BIND.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; I guess we can deal with that scenario in
>>>>userspace by doing
>>>>=C2=A0=C2=A0 the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait<= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; ourselves in one thread per engines.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; But then it makes the VM_BIND input fences useless.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; Daniel : what do you think? Should be rework this or just
>>>>=C2=A0=C2=A0 deal with
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 wait<= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; fences in userspace?
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >M= y opinion is rework this but make the ordering via
>>>>an engine
>>>>=C2=A0=C2=A0 param
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 optio= nal.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >e= .g. A VM can be configured so all binds are ordered
>>>>within the
>>>>=C2=A0=C2=A0 VM
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >e= .g. A VM can be configured so all binds accept an engine
>>>>=C2=A0=C2=A0 argument
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (in >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >t= he case of the i915 likely this is a gem context
>>>>handle) and
>>>>=C2=A0=C2=A0 binds
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >o= rdered with respect to that engine.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >T= his gives UMDs options as the later likely consumes
>>>>more KMD
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 resou= rces
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >s= o if a different UMD can live with binds being
>>>>ordered within
>>>>=C2=A0=C2=A0 the VM
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >t= hey can use a mode consuming less resources.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I thi= nk we need to be careful here if we are looking for some
>>>>=C2=A0=C2=A0 out of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (subm= ission) order completion of vm_bind/unbind.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 In-or= der completion means, in a batch of binds and
>>>>unbinds to be
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 compl= eted in-order, user only needs to specify
>>>>in-fence for the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 first= bind/unbind call and the our-fence for the last
>>>>=C2=A0=C2=A0 bind/unbind
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 call.= Also, the VA released by an unbind call can be
>>>>re-used by
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 any s= ubsequent bind call in that in-order batch.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 These= things will break if binding/unbinding were to
>>>>be allowed
>>>>=C2=A0=C2=A0 to
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 go ou= t of order (of submission) and user need to be extra
>>>>=C2=A0=C2=A0 careful
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 not t= o run into pre-mature triggereing of out-fence and bind
>>>>=C2=A0=C2=A0 failing
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 as VA= is still in use etc.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Also,= VM_BIND binds the provided mapping on the specified
>>>>=C2=A0=C2=A0 address
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 space=
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (VM).= So, the uapi is not engine/context specific.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 We ca= n however add a 'queue' to the uapi which can be
>>>>one from
>>>>=C2=A0=C2=A0 the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pre-d= efined queues,
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I915_= VM_BIND_QUEUE_0
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I915_= VM_BIND_QUEUE_1
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ... >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I915_= VM_BIND_QUEUE_(N-1)
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 KMD w= ill spawn an async work queue for each queue which will
>>>>=C2=A0=C2=A0 only
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bind = the mappings on that queue in the order of submission.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 User = can assign the queue to per engine or anything
>>>>like that.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 But a= gain here, user need to be careful and not
>>>>deadlock these
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 queue= s with circular dependency of fences.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 I pre= fer adding this later an as extension based on
>>>>whether it
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 is re= ally helping with the implementation.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 I can tell you ri= ght now that having everything on a single
>>>>=C2=A0=C2=A0 in-order
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queue will not ge= t us the perf we want.=C2=A0 What vulkan
>>>>really wants
>>>>=C2=A0=C2=A0 is one
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 of two things: >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1. No impli= cit ordering of VM_BIND ops.=C2=A0 They just happen in
>>>>=C2=A0=C2=A0 whatever
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 their dependencie= s are resolved and we ensure ordering
>>>>ourselves
>>>>=C2=A0=C2=A0 by
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 having a syncobj = in the VkQueue.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2. The abil= ity to create multiple VM_BIND queues.=C2=A0 We need at
>>>>=C2=A0=C2=A0 least 2
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 but I don't s= ee why there needs to be a limit besides
>>>>the limits
>>>>=C2=A0=C2=A0 the
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 i915 API already = has on the number of engines.=C2=A0 Vulkan could
>>>>=C2=A0=C2=A0 expose
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 multiple sparse b= inding queues to the client if it's not
>>>>=C2=A0=C2=A0 arbitrarily
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 limited.
>>>>
>>>>=C2=A0=C2=A0 Thanks Jason, Lionel.
>>>>
>>>>=C2=A0=C2=A0 Jason, what are you referring to when you say = "limits the i915 API
>>>>=C2=A0=C2=A0 already
>>>>=C2=A0=C2=A0 has on the number of engines"? I am not s= ure if there is such an uapi
>>>>=C2=A0=C2=A0 today.
>>>>
>>>>=C2=A0There's a limit of something like 64 total engine= s today based on the
>>>>=C2=A0number of bits we can cram into the exec flags in exe= cbuffer2.=C2=A0 I think
>>>>=C2=A0someone had an extended version that allowed more but= I ripped it out
>>>>=C2=A0because no one was using it.=C2=A0 Of course, execbuf= fer3 might not
>>>>have that
>>>>=C2=A0problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3= probably
>>>will not have this limiation. So, we need to define a VM_BIND_M= AX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in=
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->= 'n' meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which e= xecbuf3

Yup!=C2=A0 That's exactly t= he limit I was talking about.
=C2=A0
>>will also have. So, we can simply define in vm_bind/unbind structur= es,
>>
>>#define I915_VM_BIND_MAX_QUEUE=C2=A0=C2=A0 64
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0if (set.num_engines > I915_EXEC_RING_MASK= + 1)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -EINVAL;<= br> >
>To context creation needs to be undone and so let users create engine <= br> >maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.
<= /blockquote>

I don't get why a VM_BIND queue is any = different from any other queue or userspace-visible kernel object.=C2=A0 Bu= t I'll leave those details up to danvet or whoever else might be review= ing the implementation.

--Jason

=C2=A0

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>=C2=A0=C2=A0 I am trying to see how many queues we need and= don't want it to be
>>>>=C2=A0=C2=A0 arbitrarily
>>>>=C2=A0=C2=A0 large and unduely blow up memory usage and com= plexity in i915 driver.
>>>>
>>>>=C2=A0I expect a Vulkan driver to use at most 2 in the vast= majority
>>>>of cases. I
>>>>=C2=A0could imagine a client wanting to create more than 1 = sparse
>>>>queue in which
>>>>=C2=A0case, it'll be N+1 but that's unlikely.=C2=A0= As far as complexity
>>>>goes, once
>>>>=C2=A0you allow two, I don't think the complexity is go= ing up by
>>>>allowing N.=C2=A0 As
>>>>=C2=A0for memory usage, creating more queues means more mem= ory.=C2=A0 That's a
>>>>=C2=A0trade-off that userspace can make.=C2=A0 Again, the e= xpected number
>>>>here is 1
>>>>=C2=A0or 2 in the vast majority of cases so I don't thi= nk you need to worry.
>>>
>>>Ok, will start with n=3D3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 Why?=C2=A0 Becaus= e Vulkan has two basic kind of bind
>>>>operations and we
>>>>=C2=A0=C2=A0 don't
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 want any dependen= cies between them:
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1. Immediat= e.=C2=A0 These happen right after BO creation or
>>>>maybe as
>>>>=C2=A0=C2=A0 part of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 vkBindImageMemory= () or VkBindBufferMemory().=C2=A0 These
>>>>don't happen
>>>>=C2=A0=C2=A0 on a
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queue and we don&= #39;t want them serialized with anything.=C2=A0 To
>>>>=C2=A0=C2=A0 synchronize
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 with submit, we&#= 39;ll have a syncobj in the VkDevice which is
>>>>=C2=A0=C2=A0 signaled by
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 all immediate bin= d operations and make submits wait on it.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2. Queued (= sparse): These happen on a VkQueue which may be the
>>>>=C2=A0=C2=A0 same as
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 a render/compute = queue or may be its own queue.=C2=A0 It's up to us
>>>>=C2=A0=C2=A0 what we
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 want to advertise= .=C2=A0 From the Vulkan API PoV, this is like any
>>>>=C2=A0=C2=A0 other
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queue.=C2=A0 Oper= ations on it wait on and signal semaphores.=C2=A0 If we
>>>>=C2=A0=C2=A0 have a
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 VM_BIND engine, w= e'd provide syncobjs to wait and
>>>>signal just like
>>>>=C2=A0=C2=A0 we do
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 in execbuf().
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 The important thi= ng is that we don't want one type of
>>>>operation to
>>>>=C2=A0=C2=A0 block
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 on the other.=C2= =A0 If immediate binds are blocking on sparse binds,
>>>>=C2=A0=C2=A0 it's
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 going to cause ov= er-synchronization issues.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 In terms of the i= nternal implementation, I know that
>>>>there's going
>>>>=C2=A0=C2=A0 to be
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 a lock on the VM = and that we can't actually do these things in
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 parallel.=C2=A0 T= hat's fine.=C2=A0 Once the dma_fences have signaled and
>>>>=C2=A0=C2=A0 we're
>>>>
>>>>=C2=A0=C2=A0 Thats correct. It is like a single VM_BIND eng= ine with
>>>>multiple queues
>>>>=C2=A0=C2=A0 feeding to it.
>>>>
>>>>=C2=A0Right.=C2=A0 As long as the queues themselves are ind= ependent and
>>>>can block on
>>>>=C2=A0dma_fences without holding up other queues, I think w= e're fine.
>>>>
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 unblocked to do t= he bind operation, I don't care if
>>>>there's a bit
>>>>=C2=A0=C2=A0 of
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 synchronization d= ue to locking.=C2=A0 That's expected.=C2=A0 What
>>>>we can't
>>>>=C2=A0=C2=A0 afford
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 to have is an imm= ediate bind operation suddenly blocking on a
>>>>=C2=A0=C2=A0 sparse
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 operation which i= s blocked on a compute job that's going to run
>>>>=C2=A0=C2=A0 for
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 another 5ms.
>>>>
>>>>=C2=A0=C2=A0 As the VM_BIND queue is per VM, VM_BIND on one= VM doesn't block the
>>>>=C2=A0=C2=A0 VM_BIND
>>>>=C2=A0=C2=A0 on other VMs. I am not sure about usecases her= e, but just wanted to
>>>>=C2=A0=C2=A0 clarify.
>>>>
>>>>=C2=A0Yes, that's what I would expect.
>>>>=C2=A0--Jason
>>>>
>>>>=C2=A0=C2=A0 Niranjana
>>>>
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 For reference, Wi= ndows solves this by allowing arbitrarily many
>>>>=C2=A0=C2=A0 paging
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queues (what they= call a VM_BIND engine/queue).=C2=A0 That
>>>>design works
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 pretty well and s= olves the problems in question.=C2=A0
>>>>Again, we could
>>>>=C2=A0=C2=A0 just
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 make everything o= ut-of-order and require using syncobjs
>>>>to order
>>>>=C2=A0=C2=A0 things
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 as userspace want= s. That'd be fine too.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 One more note whi= le I'm here: danvet said something on
>>>>IRC about
>>>>=C2=A0=C2=A0 VM_BIND
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 queues waiting fo= r syncobjs to materialize.=C2=A0 We don't really
>>>>=C2=A0=C2=A0 want/need
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 this.=C2=A0 We al= ready have all the machinery in userspace to handle
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 wait-before-signa= l and waiting for syncobj fences to
>>>>materialize
>>>>=C2=A0=C2=A0 and
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 that machinery is= on by default.=C2=A0 It would actually
>>>>take MORE work
>>>>=C2=A0=C2=A0 in
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 Mesa to turn it o= ff and take advantage of the kernel
>>>>being able to
>>>>=C2=A0=C2=A0 wait
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 for syncobjs to m= aterialize.=C2=A0 Also, getting that right is
>>>>=C2=A0=C2=A0 ridiculously
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 hard and I really= don't want to get it wrong in kernel
>>>>space.=C2=A0=C2=A0=C2=A0=C2=A0 When we
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 do memory fences,= wait-before-signal will be a thing.=C2=A0 We don't
>>>>=C2=A0=C2=A0 need to
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 try and make it a= thing for syncobj.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 --Jason
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 Thanks Jason,
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 I missed the bit in the Vulka= n spec that we're allowed to have a
>>>>=C2=A0=C2=A0 sparse
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 queue that does not implement= either graphics or compute
>>>>operations
>>>>=C2=A0=C2=A0 :
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 "While some = implementations may include
>>>>=C2=A0=C2=A0 VK_QUEUE_SPARSE_BINDING_BIT
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 support in queue = families that also include
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 graphics an= d compute support, other implementations may only
>>>>=C2=A0=C2=A0 expose a
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0 VK_QUEUE_SPARSE_B= INDING_BIT-only queue
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 family.&quo= t;
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 So it can all be all a vm_bin= d engine that just does bind/unbind
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 operations.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 But yes we need another engin= e for the immediate/non-sparse
>>>>=C2=A0=C2=A0 operations.
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0 -Lionel
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Danie= l, any thoughts?
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Niran= jana
>>>>=C2=A0=C2=A0 >
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >M= att
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ><= br> >>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; Sorry I noticed this late.
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt; -Lionel
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
>>>>=C2=A0=C2=A0 >=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >&= gt;
--000000000000a2837d05e0f6c4c8--