On Thu., Jun. 3, 2021, 15:18 Daniel Vetter, <daniel@ffwll.ch> wrote:

> On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > Daniel, I think what you are suggesting is that we need to enable user
> queues with the drm scheduler and dma_fence first, and once that works, we
> can investigate how much of that kernel logic can be moved to the hw. Would
> that work? In theory it shouldn't matter whether the kernel does it or the
> hw does it. It's the same code, just in a different place.
>
> Yeah I guess that's another way to look at it. Maybe in practice
> you'll just move it from the kernel to userspace, which then programs
> the hw waits directly into its IB. That's at least how I'd do it on
> i915, assuming I'd have such hw. So these fences that userspace
> programs directly (to sync within itself) won't even show up as
> dependencies in the kernel.
>
> And then yes on the other side you can lift work from the
> drm/scheduler wrt dependencies you get in the kernel (whether explicit
> sync with sync_file, or implicit sync fished out of dma_resv) and
> program the hw directly that way. That would mean that userspace wont
> fill the ringbuffer directly, but the kernel would do that, so that
> you have space to stuff in the additional waits. Again assuming i915
> hw model, maybe works differently on amd. Iirc we have some of that
> already in the i915 scheduler, but I'd need to recheck how much it
> really uses the hw semaphores.
>

I was thinking we would pass per process syncobj handles and buffer handles
into commands in the user queue, or something equivalent. We do have a
large degree of programmability in the hw that we can do something like
that. The question is whether this high level user->hw interface would have
any advantage over trivial polling on memory, etc. My impression is no
because the kernel would be robust enough that it wouldn't matter what
userspace does, but I don't know. Anyway, all we need is user queues and
what your proposed seems totally sufficient.

Marek

-Daniel
>
> > Thanks,
> > Marek
> >
> > On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>
> >> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák <maraeo@gmail.com> wrote:
> >> >
> >> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:
> >> >>
> >> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch>
> wrote:
> >> >> >
> >> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch>
> wrote:
> >> >> > > >
> >> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <
> maraeo@gmail.com> wrote:
> >> >> > > > > >
> >> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> >> > > things
> >> >> > > > > > > for us. It's pretty much all NAK'd already. We are
> trying to gather
> >> >> > > > > more
> >> >> > > > > > > knowledge and then make better decisions.
> >> >> > > > > > >
> >> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> >> > > > > objects
> >> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> >> > > control
> >> >> > > > > the
> >> >> > > > > > > memory writes to those sync objects. The hole in that
> idea is that
> >> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> >> > > can't
> >> >> > > > > > > overwrite memory-based sync object states arbitrarily,
> it can still
> >> >> > > > > decide
> >> >> > > > > > > not to signal them, and then a future fence is born.
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > > This would actually be treated as a GPU hang caused by
> that context,
> >> >> > > so
> >> >> > > > > it
> >> >> > > > > > should be fine.
> >> >> > > > >
> >> >> > > > > This is practically what I proposed already, except your not
> doing it
> >> >> > > with
> >> >> > > > > dma_fence. And on the memory fence side this also doesn't
> actually give
> >> >> > > > > what you want for that compute model.
> >> >> > > > >
> >> >> > > > > This seems like a bit a worst of both worlds approach to me?
> Tons of
> >> >> > > work
> >> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and
> still pain to
> >> >> > > > > actually drive the hardware like it should be for compute or
> direct
> >> >> > > > > display.
> >> >> > > > >
> >> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
> >> >> > > suggestion
> >> >> > > > > how to fake the entire dma_fence stuff on top of new hw.
> Would be
> >> >> > > > > interesting to know what doesn't work there instead of amd
> folks going
> >> >> > > of
> >> >> > > > > into internal again and then coming back with another rfc
> from out of
> >> >> > > > > nowhere :-)
> >> >> > > > >
> >> >> > > >
> >> >> > > > Going internal again is probably a good idea to spare you the
> long
> >> >> > > > discussions and not waste your time, but we haven't talked
> about the
> >> >> > > > dma_fence stuff internally other than acknowledging that it
> can be
> >> >> > > solved.
> >> >> > > >
> >> >> > > > The compute use case already uses the hw as-is with no
> inter-process
> >> >> > > > sharing, which mostly keeps the kernel out of the picture. It
> uses
> >> >> > > glFinish
> >> >> > > > to sync with GL.
> >> >> > > >
> >> >> > > > The gfx use case needs new hardware logic to support implicit
> and
> >> >> > > explicit
> >> >> > > > sync. When we propose a solution, it's usually torn apart the
> next day by
> >> >> > > > ourselves.
> >> >> > > >
> >> >> > > > Since we are talking about next hw or next next hw, preemption
> should be
> >> >> > > > better.
> >> >> > > >
> >> >> > > > user queue = user-mapped ring buffer
> >> >> > > >
> >> >> > > > For implicit sync, we will only let userspace lock access to a
> buffer
> >> >> > > via a
> >> >> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> >> >> > > be
> >> >> > > > >= the number assigned by the kernel, and later unlock the
> access with
> >> >> > > > another command, which increments the per-buffer sequence
> counter in
> >> >> > > memory
> >> >> > > > with atomic_inc regardless of the number assigned by the
> kernel. The
> >> >> > > kernel
> >> >> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> >> >> > > why
> >> >> > > > it's OK. If a process increments the kernel counter but not
> the memory
> >> >> > > > counter, that's its problem and it's the same as a GPU hang
> caused by
> >> >> > > that
> >> >> > > > process. If a process increments the memory counter but not
> the kernel
> >> >> > > > counter, the ">=" condition alongside atomic_inc guarantee that
> >> >> > > signaling n
> >> >> > > > will signal n+1, so it will never deadlock but also it will
> effectively
> >> >> > > > disable synchronization. This method of disabling
> synchronization is
> >> >> > > > similar to a process corrupting the buffer, which should be
> fine. Can you
> >> >> > > > find any flaw in it? I can't find any.
> >> >> > >
> >> >> > > Hm maybe I misunderstood what exactly you wanted to do earlier.
> That kind
> >> >> > > of "we let userspace free-wheel whatever it wants, kernel ensures
> >> >> > > correctness of the resulting chain of dma_fence with reset the
> entire
> >> >> > > context" is what I proposed too.
> >> >> > >
> >> >> > > Like you say, userspace is allowed to render garbage already.
> >> >> > >
> >> >> > > > The explicit submit can be done by userspace (if there is no
> >> >> > > > synchronization), but we plan to use the kernel to do it for
> implicit
> >> >> > > sync.
> >> >> > > > Essentially, the kernel will receive a buffer list and
> addresses of wait
> >> >> > > > commands in the user queue. It will assign new sequence
> numbers to all
> >> >> > > > buffers and write those numbers into the wait commands, and
> ring the hw
> >> >> > > > doorbell to start execution of that queue.
> >> >> > >
> >> >> > > Yeah for implicit sync I think kernel and using drm/scheduler to
> sort out
> >> >> > > the dma_fence dependencies is probably best. Since you can
> filter out
> >> >> > > which dma_fence you hand to the scheduler for dependency
> tracking you can
> >> >> > > filter out your own ones and let the hw handle those directly
> (depending
> >> >> > > how much your hw can do an all that). On i915 we might do that
> to be able
> >> >> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
> scheduler.
> >> >> > >
> >> >> > > For buffer tracking with implicit sync I think cleanest is
> probably to
> >> >> > > still keep them wrapped as dma_fence and stuffed into dma_resv,
> but
> >> >> > > conceptually it's the same. If we let every driver reinvent
> their own
> >> >> > > buffer tracking just because the hw works a bit different it'll
> be a mess.
> >> >> > >
> >> >> > > Wrt wait commands: I'm honestly not sure why you'd do that.
> Userspace gets
> >> >> > > to keep the pieces if it gets it wrong. You do still need to
> handle
> >> >> > > external dma_fence though, hence drm/scheduler frontend to sort
> these out.
> >> >> > >
> >> >> >
> >> >> > The reason is to disallow lower-privileged process to
> deadlock/hang a
> >> >> > higher-privileged process where the kernel can't tell who did it.
> If the
> >> >> > implicit-sync sequence counter is read only to userspace and only
> >> >> > incrementable by the unlock-signal command after the lock-wait
> command
> >> >> > appeared in the same queue (both together forming a critical
> section),
> >> >> > userspace can't manipulate it arbitrarily and we get almost the
> exact same
> >> >> > behavior as implicit sync has today. That means any
> implicitly-sync'd
> >> >> > buffer from any process can be fully trusted by a compositor to
> signal in a
> >> >> > finite time, and possibly even trusted by the kernel. The only
> thing that's
> >> >> > different is that a malicious process can disable implicit sync
> for a
> >> >> > buffer in all processes/kernel, but it can't hang other
> processes/kernel
> >> >> > (it can only hang itself and the kernel will be notified). So I'm
> a happy
> >> >> > panda now. :)
> >> >>
> >> >> Yeah I think that's not going to work too well, and is too many
> piled up
> >> >> hacks. Within a drm_file fd you can do whatever you feel like, since
> it's
> >> >> just one client.
> >> >>
> >> >> But once implicit sync kicks in I think you need to go with
> dma_fence and
> >> >> drm/scheduler to handle the dependencies, and tdr kicking it. With
> the
> >> >> dma_fence you do know who's the offender - you might not know why,
> but
> >> >> that doesn't matter, you just shred the entire context and let that
> >> >> userspace figure out the details.
> >> >>
> >> >> I think trying to make memory fences work as implicit sync directly,
> >> >> without wrapping them in a dma_fence and assorted guarantees, will
> just
> >> >> not work.
> >> >>
> >> >> And once you do wrap them in dma_fence, then all the other problems
> go
> >> >> away: cross-driver sync, syncfiles, ... So I really don't see the
> benefit
> >> >> of this half-way approach.
> >> >>
> >> >> Yes there's going to be a tad bit of overhead, but that's already
> there in
> >> >> the current model. And it can't hurt to have a bit of motivation for
> >> >> compositors to switch over to userspace memory fences properly.
> >> >
> >> >
> >> > Well, Christian thinks that we need a high level synchronization
> primitive in hw. I don't know myself and you may be right. A software
> scheduler with user queues might be one option. My part is only to find out
> how much of the scheduler logic can be moved to the hardware.
> >> >
> >> > We plan to have memory timeline semaphores, or simply monotonic
> counters, and a fence will be represented by the counter address and a
> constant sequence number for the <= comparison. One counter can represent
> up to 2^64 different fences. Giving any process write access to a fence is
> the same as giving it the power to manipulate the signalled state of a
> sequence of up to 2^64 fences. That could mess up a lot of things. However,
> if the hardware had a high level synchronization primitive with access
> rights and a limited set of clearly defined operations such that we can
> formally prove whether it's safe for everybody, we could have a solution
> where we don't have to involve the software scheduler and just let the
> hardware do everything.
> >>
> >> I don't think hw access rights control on memory fences makes sense.
> >> There's two cases:
> >>
> >> - brave new world of native userspace memory fences. Currently that's
> >> compute, maybe direct display vk, hopefully/eventually compositors and
> >> desktops too. If you get an untrusted fence, you need to have fallback
> >> logic no matter what, and by design. vk is explicit in stating that if
> >> things hang, you get to keep all the pieces. So the compositor needs
> >> to _always_ treat userspace memory fences as hostile, wrap them in a
> >> timeout, and have a fallback frame/scene in its rendering path.
> >> Probably same for the kernel on display side, maybe down to the
> >> display hw picking the "right" frame depending upon the fence value
> >> right before scanout as we discussed earlier. There's no point in hw
> >> access rights because by design, even if no one tampers with your
> >> fence, it might never signal. So you have to cope with a hostile fence
> >> from untrusted sources anyway (and within an app it's trusted and you
> >> just die as in stuck in an endless loop until someon sends a SIGKILL
> >> when you deadlock or get it wrong some other way).
> >>
> >> - old compat mode where we need to use dma_fence, otherwise we end up
> >> with another round of "amdgpu redefines implicit sync in incompatible
> >> ways", and Christian&me don't even know yet how to fix the current
> >> round without breaking use-cases badly yet. So it has to be dma_fence,
> >> and it has to be the same rules as on old hw, or it's just not going
> >> to work. This means you need to force in-order retiring of fences in
> >> the kernel, and you need to enforce timeout. None of this needs hw
> >> access rights control, since once more it's just software constructs
> >> in the kernel. As a first appromixation, drm/scheduler + the fence
> >> chain we already have in syncobj is probably good enough for this.
> >> E.g. if userspace rolls the fence backwards then the kernel just
> >> ignores that, because its internal dma_fence has signalled, and
> >> dma_fences never unsignal (it's a bit in struct dma_fence, once it's
> >> set we stop checking hw). And if it doesn't signal in time, then tdr
> >> jumps in and fixes the mess. Hw access control doesn't fix anything
> >> here, because you have to deal with timeout and ordering already
> >> anyway, or the dma_fence contract is broken.
> >>
> >> So in both cases hw access control gains you nothing (at least I'm not
> >> seeing anything), it just makes the design more tricky. "Userspace can
> >> manipulate the fences" is _intentionally_ how these things work, we
> >> need a design that works with that hw design, not against it and
> >> somehow tries to get us back to the old world, but only halfway (i.e.
> >> not useful at all, since old userspace needs us to go all the way back
> >> to dma_fence, and new userspace wants to fully exploit userspace
> >> memory fences without artificial limitations for no reason).
> >> -Daniel
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>