Proposal for a new CS ioctl, kernel pseudo code: lock(&global_lock); serial = get_next_serial(dev); add_wait_command(ring, serial - 1); add_exec_cmdbuf(ring, user_cmdbuf); add_signal_command(ring, serial); *ring->doorbell = FIRE; unlock(&global_lock); See? Just like userspace submit, but in the kernel without concurrency/preemption. Is this now safe enough for dma_fence? Marek On Mon, May 3, 2021 at 4:36 PM Marek Olšák wrote: > What about direct submit from the kernel where the process still has write > access to the GPU ring buffer but doesn't use it? I think that solves your > preemption example, but leaves a potential backdoor for a process to > overwrite the signal commands, which shouldn't be a problem since we are OK > with timeouts. > > Marek > > On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand > wrote: > >> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen >> wrote: >> > >> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand >> wrote: >> > > >> > > Sorry for the top-post but there's no good thing to reply to here... >> > > >> > > One of the things pointed out to me recently by Daniel Vetter that I >> > > didn't fully understand before is that dma_buf has a very subtle >> > > second requirement beyond finite time completion: Nothing required >> > > for signaling a dma-fence can allocate memory. Why? Because the act >> > > of allocating memory may wait on your dma-fence. This, as it turns >> > > out, is a massively more strict requirement than finite time >> > > completion and, I think, throws out all of the proposals we have so >> > > far. >> > > >> > > Take, for instance, Marek's proposal for userspace involvement with >> > > dma-fence by asking the kernel for a next serial and the kernel >> > > trusting userspace to signal it. That doesn't work at all if >> > > allocating memory to trigger a dma-fence can blow up. There's simply >> > > no way for the kernel to trust userspace to not do ANYTHING which >> > > might allocate memory. I don't even think there's a way userspace can >> > > trust itself there. It also blows up my plan of moving the fences to >> > > transition boundaries. >> > > >> > > Not sure where that leaves us. >> > >> > Honestly the more I look at things I think userspace-signalable fences >> > with a timeout sound like they are a valid solution for these issues. >> > Especially since (as has been mentioned countless times in this email >> > thread) userspace already has a lot of ways to cause timeouts and or >> > GPU hangs through GPU work already. >> > >> > Adding a timeout on the signaling side of a dma_fence would ensure: >> > >> > - The dma_fence signals in finite time >> > - If the timeout case does not allocate memory then memory allocation >> > is not a blocker for signaling. >> > >> > Of course you lose the full dependency graph and we need to make sure >> > garbage collection of fences works correctly when we have cycles. >> > However, the latter sounds very doable and the first sounds like it is >> > to some extent inevitable. >> > >> > I feel like I'm missing some requirement here given that we >> > immediately went to much more complicated things but can't find it. >> > Thoughts? >> >> Timeouts are sufficient to protect the kernel but they make the fences >> unpredictable and unreliable from a userspace PoV. One of the big >> problems we face is that, once we expose a dma_fence to userspace, >> we've allowed for some pretty crazy potential dependencies that >> neither userspace nor the kernel can sort out. Say you have marek's >> "next serial, please" proposal and a multi-threaded application. >> Between time time you ask the kernel for a serial and get a dma_fence >> and submit the work to signal that serial, your process may get >> preempted, something else shoved in which allocates memory, and then >> we end up blocking on that dma_fence. There's no way userspace can >> predict and defend itself from that. >> >> So I think where that leaves us is that there is no safe place to >> create a dma_fence except for inside the ioctl which submits the work >> and only after any necessary memory has been allocated. That's a >> pretty stiff requirement. We may still be able to interact with >> userspace a bit more explicitly but I think it throws any notion of >> userspace direct submit out the window. >> >> --Jason >> >> >> > - Bas >> > > >> > > --Jason >> > > >> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher >> wrote: >> > > > >> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák >> wrote: >> > > > > >> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer >> wrote: >> > > > >> >> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote: >> > > > >> > Hi Dave, >> > > > >> > >> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák: >> > > > >> >> Supporting interop with any device is always possible. It >> depends on which drivers we need to interoperate with and update them. >> We've already found the path forward for amdgpu. We just need to find out >> how many other drivers need to be updated and evaluate the cost/benefit >> aspect. >> > > > >> >> >> > > > >> >> Marek >> > > > >> >> >> > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie < >> airlied@gmail.com > wrote: >> > > > >> >> >> > > > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König >> > > > >> >> > ckoenig.leichtzumerken@gmail.com>> wrote: >> > > > >> >> > >> > > > >> >> > Correct, we wouldn't have synchronization between >> device with and without user queues any more. >> > > > >> >> > >> > > > >> >> > That could only be a problem for A+I Laptops. >> > > > >> >> >> > > > >> >> Since I think you mentioned you'd only be enabling this >> on newer >> > > > >> >> chipsets, won't it be a problem for A+A where one A is a >> generation >> > > > >> >> behind the other? >> > > > >> >> >> > > > >> > >> > > > >> > Crap, that is a good point as well. >> > > > >> > >> > > > >> >> >> > > > >> >> I'm not really liking where this is going btw, seems like >> a ill >> > > > >> >> thought out concept, if AMD is really going down the road >> of designing >> > > > >> >> hw that is currently Linux incompatible, you are going to >> have to >> > > > >> >> accept a big part of the burden in bringing this support >> in to more >> > > > >> >> than just amd drivers for upcoming generations of gpu. >> > > > >> >> >> > > > >> > >> > > > >> > Well we don't really like that either, but we have no other >> option as far as I can see. >> > > > >> >> > > > >> I don't really understand what "future hw may remove support for >> kernel queues" means exactly. While the per-context queues can be mapped to >> userspace directly, they don't *have* to be, do they? I.e. the kernel >> driver should be able to either intercept userspace access to the queues, >> or in the worst case do it all itself, and provide the existing >> synchronization semantics as needed? >> > > > >> >> > > > >> Surely there are resource limits for the per-context queues, so >> the kernel driver needs to do some kind of virtualization / multi-plexing >> anyway, or we'll get sad user faces when there's no queue available for >> . >> > > > >> >> > > > >> I'm probably missing something though, awaiting enlightenment. :) >> > > > > >> > > > > >> > > > > The hw interface for userspace is that the ring buffer is mapped >> to the process address space alongside a doorbell aperture (4K page) that >> isn't real memory, but when the CPU writes into it, it tells the hw >> scheduler that there are new GPU commands in the ring buffer. Userspace >> inserts all the wait, draw, and signal commands into the ring buffer and >> then "rings" the doorbell. It's my understanding that the ring buffer and >> the doorbell are always mapped in the same GPU address space as the >> process, which makes it very difficult to emulate the current protected >> ring buffers in the kernel. The VMID of the ring buffer is also not >> changeable. >> > > > > >> > > > >> > > > The doorbell does not have to be mapped into the process's GPU >> virtual >> > > > address space. The CPU could write to it directly. Mapping it into >> > > > the GPU's virtual address space would allow you to have a device >> kick >> > > > off work however rather than the CPU. E.g., the GPU could kick off >> > > > it's own work or multiple devices could kick off work without CPU >> > > > involvement. >> > > > >> > > > Alex >> > > > >> > > > >> > > > > The hw scheduler doesn't do any synchronization and it doesn't >> see any dependencies. It only chooses which queue to execute, so it's >> really just a simple queue manager handling the virtualization aspect and >> not much else. >> > > > > >> > > > > Marek >> > > > > _______________________________________________ >> > > > > dri-devel mailing list >> > > > > dri-devel@lists.freedesktop.org >> > > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel >> > > > _______________________________________________ >> > > > mesa-dev mailing list >> > > > mesa-dev@lists.freedesktop.org >> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > > _______________________________________________ >> > > dri-devel mailing list >> > > dri-devel@lists.freedesktop.org >> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel >> >