On Thu., Jun. 3, 2021, 15:18 Daniel Vetter, wrote: > On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák wrote: > > > > Daniel, I think what you are suggesting is that we need to enable user > queues with the drm scheduler and dma_fence first, and once that works, we > can investigate how much of that kernel logic can be moved to the hw. Would > that work? In theory it shouldn't matter whether the kernel does it or the > hw does it. It's the same code, just in a different place. > > Yeah I guess that's another way to look at it. Maybe in practice > you'll just move it from the kernel to userspace, which then programs > the hw waits directly into its IB. That's at least how I'd do it on > i915, assuming I'd have such hw. So these fences that userspace > programs directly (to sync within itself) won't even show up as > dependencies in the kernel. > > And then yes on the other side you can lift work from the > drm/scheduler wrt dependencies you get in the kernel (whether explicit > sync with sync_file, or implicit sync fished out of dma_resv) and > program the hw directly that way. That would mean that userspace wont > fill the ringbuffer directly, but the kernel would do that, so that > you have space to stuff in the additional waits. Again assuming i915 > hw model, maybe works differently on amd. Iirc we have some of that > already in the i915 scheduler, but I'd need to recheck how much it > really uses the hw semaphores. > I was thinking we would pass per process syncobj handles and buffer handles into commands in the user queue, or something equivalent. We do have a large degree of programmability in the hw that we can do something like that. The question is whether this high level user->hw interface would have any advantage over trivial polling on memory, etc. My impression is no because the kernel would be robust enough that it wouldn't matter what userspace does, but I don't know. Anyway, all we need is user queues and what your proposed seems totally sufficient. Marek -Daniel > > > Thanks, > > Marek > > > > On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter wrote: > >> > >> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák wrote: > >> > > >> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, wrote: > >> >> > >> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote: > >> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter > wrote: > >> >> > > >> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote: > >> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter > wrote: > >> >> > > > > >> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > >> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák < > maraeo@gmail.com> wrote: > >> >> > > > > > > >> >> > > > > > > Yes, we can't break anything because we don't want to > complicate > >> >> > > things > >> >> > > > > > > for us. It's pretty much all NAK'd already. We are > trying to gather > >> >> > > > > more > >> >> > > > > > > knowledge and then make better decisions. > >> >> > > > > > > > >> >> > > > > > > The idea we are considering is that we'll expose > memory-based sync > >> >> > > > > objects > >> >> > > > > > > to userspace for read only, and the kernel or hw will > strictly > >> >> > > control > >> >> > > > > the > >> >> > > > > > > memory writes to those sync objects. The hole in that > idea is that > >> >> > > > > > > userspace can decide not to signal a job, so even if > userspace > >> >> > > can't > >> >> > > > > > > overwrite memory-based sync object states arbitrarily, > it can still > >> >> > > > > decide > >> >> > > > > > > not to signal them, and then a future fence is born. > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > This would actually be treated as a GPU hang caused by > that context, > >> >> > > so > >> >> > > > > it > >> >> > > > > > should be fine. > >> >> > > > > > >> >> > > > > This is practically what I proposed already, except your not > doing it > >> >> > > with > >> >> > > > > dma_fence. And on the memory fence side this also doesn't > actually give > >> >> > > > > what you want for that compute model. > >> >> > > > > > >> >> > > > > This seems like a bit a worst of both worlds approach to me? > Tons of > >> >> > > work > >> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and > still pain to > >> >> > > > > actually drive the hardware like it should be for compute or > direct > >> >> > > > > display. > >> >> > > > > > >> >> > > > > Also maybe I've missed it, but I didn't see any replies to my > >> >> > > suggestion > >> >> > > > > how to fake the entire dma_fence stuff on top of new hw. > Would be > >> >> > > > > interesting to know what doesn't work there instead of amd > folks going > >> >> > > of > >> >> > > > > into internal again and then coming back with another rfc > from out of > >> >> > > > > nowhere :-) > >> >> > > > > > >> >> > > > > >> >> > > > Going internal again is probably a good idea to spare you the > long > >> >> > > > discussions and not waste your time, but we haven't talked > about the > >> >> > > > dma_fence stuff internally other than acknowledging that it > can be > >> >> > > solved. > >> >> > > > > >> >> > > > The compute use case already uses the hw as-is with no > inter-process > >> >> > > > sharing, which mostly keeps the kernel out of the picture. It > uses > >> >> > > glFinish > >> >> > > > to sync with GL. > >> >> > > > > >> >> > > > The gfx use case needs new hardware logic to support implicit > and > >> >> > > explicit > >> >> > > > sync. When we propose a solution, it's usually torn apart the > next day by > >> >> > > > ourselves. > >> >> > > > > >> >> > > > Since we are talking about next hw or next next hw, preemption > should be > >> >> > > > better. > >> >> > > > > >> >> > > > user queue = user-mapped ring buffer > >> >> > > > > >> >> > > > For implicit sync, we will only let userspace lock access to a > buffer > >> >> > > via a > >> >> > > > user queue, which waits for the per-buffer sequence counter in > memory to > >> >> > > be > >> >> > > > >= the number assigned by the kernel, and later unlock the > access with > >> >> > > > another command, which increments the per-buffer sequence > counter in > >> >> > > memory > >> >> > > > with atomic_inc regardless of the number assigned by the > kernel. The > >> >> > > kernel > >> >> > > > counter and the counter in memory can be out-of-sync, and I'll > explain > >> >> > > why > >> >> > > > it's OK. If a process increments the kernel counter but not > the memory > >> >> > > > counter, that's its problem and it's the same as a GPU hang > caused by > >> >> > > that > >> >> > > > process. If a process increments the memory counter but not > the kernel > >> >> > > > counter, the ">=" condition alongside atomic_inc guarantee that > >> >> > > signaling n > >> >> > > > will signal n+1, so it will never deadlock but also it will > effectively > >> >> > > > disable synchronization. This method of disabling > synchronization is > >> >> > > > similar to a process corrupting the buffer, which should be > fine. Can you > >> >> > > > find any flaw in it? I can't find any. > >> >> > > > >> >> > > Hm maybe I misunderstood what exactly you wanted to do earlier. > That kind > >> >> > > of "we let userspace free-wheel whatever it wants, kernel ensures > >> >> > > correctness of the resulting chain of dma_fence with reset the > entire > >> >> > > context" is what I proposed too. > >> >> > > > >> >> > > Like you say, userspace is allowed to render garbage already. > >> >> > > > >> >> > > > The explicit submit can be done by userspace (if there is no > >> >> > > > synchronization), but we plan to use the kernel to do it for > implicit > >> >> > > sync. > >> >> > > > Essentially, the kernel will receive a buffer list and > addresses of wait > >> >> > > > commands in the user queue. It will assign new sequence > numbers to all > >> >> > > > buffers and write those numbers into the wait commands, and > ring the hw > >> >> > > > doorbell to start execution of that queue. > >> >> > > > >> >> > > Yeah for implicit sync I think kernel and using drm/scheduler to > sort out > >> >> > > the dma_fence dependencies is probably best. Since you can > filter out > >> >> > > which dma_fence you hand to the scheduler for dependency > tracking you can > >> >> > > filter out your own ones and let the hw handle those directly > (depending > >> >> > > how much your hw can do an all that). On i915 we might do that > to be able > >> >> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw > scheduler. > >> >> > > > >> >> > > For buffer tracking with implicit sync I think cleanest is > probably to > >> >> > > still keep them wrapped as dma_fence and stuffed into dma_resv, > but > >> >> > > conceptually it's the same. If we let every driver reinvent > their own > >> >> > > buffer tracking just because the hw works a bit different it'll > be a mess. > >> >> > > > >> >> > > Wrt wait commands: I'm honestly not sure why you'd do that. > Userspace gets > >> >> > > to keep the pieces if it gets it wrong. You do still need to > handle > >> >> > > external dma_fence though, hence drm/scheduler frontend to sort > these out. > >> >> > > > >> >> > > >> >> > The reason is to disallow lower-privileged process to > deadlock/hang a > >> >> > higher-privileged process where the kernel can't tell who did it. > If the > >> >> > implicit-sync sequence counter is read only to userspace and only > >> >> > incrementable by the unlock-signal command after the lock-wait > command > >> >> > appeared in the same queue (both together forming a critical > section), > >> >> > userspace can't manipulate it arbitrarily and we get almost the > exact same > >> >> > behavior as implicit sync has today. That means any > implicitly-sync'd > >> >> > buffer from any process can be fully trusted by a compositor to > signal in a > >> >> > finite time, and possibly even trusted by the kernel. The only > thing that's > >> >> > different is that a malicious process can disable implicit sync > for a > >> >> > buffer in all processes/kernel, but it can't hang other > processes/kernel > >> >> > (it can only hang itself and the kernel will be notified). So I'm > a happy > >> >> > panda now. :) > >> >> > >> >> Yeah I think that's not going to work too well, and is too many > piled up > >> >> hacks. Within a drm_file fd you can do whatever you feel like, since > it's > >> >> just one client. > >> >> > >> >> But once implicit sync kicks in I think you need to go with > dma_fence and > >> >> drm/scheduler to handle the dependencies, and tdr kicking it. With > the > >> >> dma_fence you do know who's the offender - you might not know why, > but > >> >> that doesn't matter, you just shred the entire context and let that > >> >> userspace figure out the details. > >> >> > >> >> I think trying to make memory fences work as implicit sync directly, > >> >> without wrapping them in a dma_fence and assorted guarantees, will > just > >> >> not work. > >> >> > >> >> And once you do wrap them in dma_fence, then all the other problems > go > >> >> away: cross-driver sync, syncfiles, ... So I really don't see the > benefit > >> >> of this half-way approach. > >> >> > >> >> Yes there's going to be a tad bit of overhead, but that's already > there in > >> >> the current model. And it can't hurt to have a bit of motivation for > >> >> compositors to switch over to userspace memory fences properly. > >> > > >> > > >> > Well, Christian thinks that we need a high level synchronization > primitive in hw. I don't know myself and you may be right. A software > scheduler with user queues might be one option. My part is only to find out > how much of the scheduler logic can be moved to the hardware. > >> > > >> > We plan to have memory timeline semaphores, or simply monotonic > counters, and a fence will be represented by the counter address and a > constant sequence number for the <= comparison. One counter can represent > up to 2^64 different fences. Giving any process write access to a fence is > the same as giving it the power to manipulate the signalled state of a > sequence of up to 2^64 fences. That could mess up a lot of things. However, > if the hardware had a high level synchronization primitive with access > rights and a limited set of clearly defined operations such that we can > formally prove whether it's safe for everybody, we could have a solution > where we don't have to involve the software scheduler and just let the > hardware do everything. > >> > >> I don't think hw access rights control on memory fences makes sense. > >> There's two cases: > >> > >> - brave new world of native userspace memory fences. Currently that's > >> compute, maybe direct display vk, hopefully/eventually compositors and > >> desktops too. If you get an untrusted fence, you need to have fallback > >> logic no matter what, and by design. vk is explicit in stating that if > >> things hang, you get to keep all the pieces. So the compositor needs > >> to _always_ treat userspace memory fences as hostile, wrap them in a > >> timeout, and have a fallback frame/scene in its rendering path. > >> Probably same for the kernel on display side, maybe down to the > >> display hw picking the "right" frame depending upon the fence value > >> right before scanout as we discussed earlier. There's no point in hw > >> access rights because by design, even if no one tampers with your > >> fence, it might never signal. So you have to cope with a hostile fence > >> from untrusted sources anyway (and within an app it's trusted and you > >> just die as in stuck in an endless loop until someon sends a SIGKILL > >> when you deadlock or get it wrong some other way). > >> > >> - old compat mode where we need to use dma_fence, otherwise we end up > >> with another round of "amdgpu redefines implicit sync in incompatible > >> ways", and Christian&me don't even know yet how to fix the current > >> round without breaking use-cases badly yet. So it has to be dma_fence, > >> and it has to be the same rules as on old hw, or it's just not going > >> to work. This means you need to force in-order retiring of fences in > >> the kernel, and you need to enforce timeout. None of this needs hw > >> access rights control, since once more it's just software constructs > >> in the kernel. As a first appromixation, drm/scheduler + the fence > >> chain we already have in syncobj is probably good enough for this. > >> E.g. if userspace rolls the fence backwards then the kernel just > >> ignores that, because its internal dma_fence has signalled, and > >> dma_fences never unsignal (it's a bit in struct dma_fence, once it's > >> set we stop checking hw). And if it doesn't signal in time, then tdr > >> jumps in and fixes the mess. Hw access control doesn't fix anything > >> here, because you have to deal with timeout and ordering already > >> anyway, or the dma_fence contract is broken. > >> > >> So in both cases hw access control gains you nothing (at least I'm not > >> seeing anything), it just makes the design more tricky. "Userspace can > >> manipulate the fences" is _intentionally_ how these things work, we > >> need a design that works with that hw design, not against it and > >> somehow tries to get us back to the old world, but only halfway (i.e. > >> not useful at all, since old userspace needs us to go all the way back > >> to dma_fence, and new userspace wants to fully exploit userspace > >> memory fences without artificial limitations for no reason). > >> -Daniel > >> -- > >> Daniel Vetter > >> Software Engineer, Intel Corporation > >> http://blog.ffwll.ch > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch >