We're happy to bear the pain of being the ones setting
strict and unreasonable expectations. To me, this 'present
ioctl' falls into the uncanny valley of the kernel trying to
bear too much of the weight to be tractable, whilst not
bearing enough of the weight to be useful for winsys.
So here's my principles for a counter-strawman:
Remove the 'return fence'. Burn it with fire, do not look
back. Modern presentation pipelines are not necessarily 1:1,
they are not necessarily FIFO (as opposed to mailbox), and
they are not necessarily round-robin either. The current
proposal provides no tangible benefits to modern userspace,
and fixing that requires either hobbling userspace to remove
capability and flexibility (ironic given that the motivation
for this is all about userspace flexibility?), or pushing so
much complexity into the kernel that we break it forever
(you can't compile Mutter's per-frame decision tree into
eBPF).
Give us a primitive representing work completion, so we
can keep optimistically pipelining operations. We're happy
to pass around explicit-synchronisation tokens (dma_fence,
drm_syncobj, drm_newthing, whatever it is): plumbing through
a sync token to synchronise compositor operations against
client operations in both directions is just a matter of
boring typing.
Make that primitive something that is every bit as usable
across subsystems as it is across processes. It should be a
lowest common denominator for middleware that ultimately
provokes GPU execbuf, KMS commit, and media codec ops;
currently that would be both wait and signal for all
of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, and
VA-API {en,de}code ops. It must be exportable to and
importable from an FD, which can be poll()ed on and read().
GPU-side visibility for late binding is nice, but not at all
essential.
Make that primitive complete in 'reasonable' time, no
matter what. There will always be failures in extremis, no
matter what the design: absent hard-realtime principles from
hardware all the way up to userspace, something will always
be able to fail somewhere: non-terminating GPU work, actual
GPU hang/reset, GPU queue DoSed, CPU scheduler, I/O DoSed.
As long as the general case is bounded-time completion, each
of these can be mitigated separately as long as userspace
has enough visibility into the underlying mechanics, and
cares enough to take meaningful action on it.
And something more concrete:
dma_fence.
This already has all of the properties described above.
Kernel-wise, it already devolves to CPU-side signaling when
it crosses device boundaries. We need to support it roughly
forever since it's been plumbed so far and so wide. Any
primitive which is acceptable for winsys-like usage which
crosses so many device/subsystem/process/security boundaries
has to meet the same requirements. So why reinvent something
which looks so similar, and has the same requirements of the
kernel babysitting completion, providing little to no
benefit for that difference?
It's not usable for complex usecases, as we've
established, but winsys is not that usecase. We can draw a
hard boundary between the two worlds. For example, a client
could submit an infinitely deep CS -> VS/FS/etc job chain
with potentially-infinite completion, with the FS output
being passed to the winsys for composition. Draw the line
post-FS: export a dma_fence against FS completion. But
instead of this being based on monitoring the _fence_ per
se, base it on monitoring the job; if the final job doesn't
retire in reasonable time, signal the fence and signal
(like, SIGKILL, or just tear down the context and
permanently -EIO, whatever) the client. Maybe for future
hardware that would be the same thing - the kernel setting a
timeout and comparing a read on a particular address against
a particular value - but the 'present fence' proposal seems
like it requires exactly this anyway.
That to me is the best compromise. We allow clients
complete arbitrary flexibility, but as soon as they
vkQueuePresentKHR, they're crossing a boundary out of happy
fun GPU land and into strange hostile winsys land. We've got
a lot of practice at being the bad guys who hate users and
are always trying to ruin their dreams, so we'll happily
wear the impact of continuing to do that. In doing so, we
collectively don't have to invent a third new
synchronisation primitive (to add to dma_fence and
drm_syncobj) and a third new synchronisation model (implicit
sync, explicit-but-bounded sync,
explicit-and-maybe-unbounded sync) to support this, and we
don't have to do an NT4 where GDI was shoved into the
kernel.
It doesn't help with the goal of ridding dma_fence from
the kernel, but it does very clearly segregate the two
worlds. Drawing that hard boundary would allow drivers to
hyperoptimise for clients which want to be extremely clever
and agile and quick because they're sailing so close to the
wind that they cannot bear the overhead of dma_fence, whilst
also providing the guarantees we need when crossing
isolation boundaries. In the latter case, the overhead of
bouncing into a less-optimised primitive is totally
acceptable because it's not even measurable:
vkQueuePresentKHR requires client CPU activity -> kernel
IPC -> compositor CPU activity -> wait for repaint
cycle -> prepare scene -> composition, against which
dma_fence overhead isn't and will never be measurable (even
if it doesn't cross device/subsystem boundaries, which it
probably does). And the converse for vkAcquireNextImageKHR.
tl;dr: we don't need to move winsys into the kernel,
winsys and compute don't need to share sync primitives, the
client/winsys boundary does need to have a sync primitive
does need strong and onerous guarantees, and that transition
can be several orders of magnitude less efficient than
intra-client sync primitives
Shoot me down. :)
Cheers,
Daniel