All of lore.kernel.org
 help / color / mirror / Atom feed
* Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-11 17:31 ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-11 17:31 UTC (permalink / raw)
  To: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

All,

Sorry for casting such a broad net with this one. I'm sure most people
who reply will get at least one mailing list rejection.  However, this
is an issue that affects a LOT of components and that's why it's
thorny to begin with.  Please pardon the length of this e-mail as
well; I promise there's a concrete point/proposal at the end.


Explicit synchronization is the future of graphics and media.  At
least, that seems to be the consensus among all the graphics people
I've talked to.  I had a chat with one of the lead Android graphics
engineers recently who told me that doing explicit sync from the start
was one of the best engineering decisions Android ever made.  It's
also the direction being taken by more modern APIs such as Vulkan.


## What are implicit and explicit synchronization?

For those that aren't familiar with this space, GPUs, media encoders,
etc. are massively parallel and synchronization of some form is
required to ensure that everything happens in the right order and
avoid data races.  Implicit synchronization is when bits of work (3D,
compute, video encode, etc.) are implicitly based on the absolute
CPU-time order in which API calls occur.  Explicit synchronization is
when the client (whatever that means in any given context) provides
the dependency graph explicitly via some sort of synchronization
primitives.  If you're still confused, consider the following
examples:

With OpenGL and EGL, almost everything is implicit sync.  Say you have
two OpenGL contexts sharing an image where one writes to it and the
other textures from it.  The way the OpenGL spec works, the client has
to make the API calls to render to the image before (in CPU time) it
makes the API calls which texture from the image.  As long as it does
this (and maybe inserts a glFlush?), the driver will ensure that the
rendering completes before the texturing happens and you get correct
contents.

Implicit synchronization can also happen across processes.  Wayland,
for instance, is currently built on implicit sync where the client
does their rendering and then does a hand-off (via wl_surface::commit)
to tell the compositor it's done at which point the compositor can now
texture from the surface.  The hand-off ensures that the client's
OpenGL API calls happen before the server's OpenGL API calls.

A good example of explicit synchronization is the Vulkan API.  There,
a client (or multiple clients) can simultaneously build command
buffers in different threads where one of those command buffers
renders to an image and the other textures from it and then submit
both of them at the same time with instructions to the driver for
which order to execute them in.  The execution order is described via
the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
extension, you can even submit the work which does the texturing
BEFORE the work which does the rendering and the driver will sort it
out.

The #1 problem with implicit synchronization (which explicit solves)
is that it leads to a lot of over-synchronization both in client space
and in driver/device space.  The client has to synchronize a lot more
because it has to ensure that the API calls happen in a particular
order.  The driver/device have to synchronize a lot more because they
never know what is going to end up being a synchronization point as an
API call on another thread/process may occur at any time.  As we move
to more and more multi-threaded programming this synchronization (on
the client-side especially) becomes more and more painful.


## Current status in Linux

Implicit synchronization in Linux works via a the kernel's internal
dma_buf and dma_fence data structures.  A dma_fence is a tiny object
which represents the "done" status for some bit of work.  Typically,
dma_fences are created as a by-product of someone submitting some bit
of work (say, 3D rendering) to the kernel.  The dma_buf object has a
set of dma_fences on it representing shared (read) and exclusive
(write) access to the object.  When work is submitted which, for
instance renders to the dma_buf, it's queued waiting on all the fences
on the dma_buf and and a dma_fence is created representing the end of
said rendering work and it's installed as the dma_buf's exclusive
fence.  This way, the kernel can manage all its internal queues (3D
rendering, display, video encode, etc.) and know which things to
submit in what order.

For the last few years, we've had sync_file in the kernel and it's
plumbed into some drivers.  A sync_file is just a wrapper around a
single dma_fence.  A sync_file is typically created as a by-product of
submitting work (3D, compute, etc.) to the kernel and is signaled when
that work completes.  When a sync_file is created, it is guaranteed by
the kernel that it will become signaled in finite time and, once it's
signaled, it remains signaled for the rest of time.  A sync_file is
represented in UAPIs as a file descriptor and can be used with normal
file APIs such as dup().  It can be passed into another UAPI which
does some bit of queue'd work and the submitted work will wait for the
sync_file to be triggered before executing.  A sync_file also supports
poll() if  you want to wait on it manually.

Unfortunately, sync_file is not broadly used and not all kernel GPU
drivers support it.  Here's a very quick overview of my understanding
of the status of various components (I don't know the status of
anything in the media world):

 - Vulkan: Explicit synchronization all the way but we have to go
implicit as soon as we interact with a window-system.  Vulkan has APIs
to import/export sync_files to/from it's VkSemaphore and VkFence
synchronization primitives.
 - OpenGL: Implicit all the way.  There are some EGL extensions to
enable some forms of explicit sync via sync_file but OpenGL itself is
still implicit.
 - Wayland: Currently depends on implicit sync in the kernel (accessed
via EGL/OpenGL).  There is an unstable extension to allow passing
sync_files around but it's questionable how useful it is right now
(more on that later).
 - X11: With present, it has these "explicit" fence objects but
they're always a shmfence which lets the X server and client do a
userspace CPU-side hand-off without going over the socket (and
round-tripping through the kernel).  However, the only thing that
fence does is order the OpenGL API calls in the client and server and
the real synchronization is still implicit.
 - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
 - linux/amdgpu: Supports sync_file and syncobj but it still
implicitly syncs sometimes due to it's internal memory residency
handling which can lead to over-synchronization.
 - KMS: Implicit sync all the way.  There are no KMS APIs which take
explicit sync primitives.
 - v4l: ???
 - gstreamer: ???
 - Media APIs such as vaapi etc.:  ???


## Chicken and egg problems

Ok, this is where it starts getting depressing.  I made the claim
above that Wayland has an explicit synchronization protocol that's of
questionable usefulness.  I would claim that basically any bit of
plumbing we do through window systems is currently of questionable
usefulness.  Why?

From my perspective, as a Vulkan driver developer, I have to deal with
the fact that Vulkan is an explicit sync API but Wayland and X11
aren't.  Unfortunately, the Wayland extension solves zero problems for
me because I can't really use it unless it's implemented in all of the
compositors.  Until every Wayland compositor I care about my users
being able to use (which is basically all of them) supports the
extension, I have to continue carry around my pile of hacks to keep
implicit sync and Vulkan working nicely together.

From the perspective of a Wayland compositor (I used to play in this
space), they'd love to implement the new explicit sync extension but
can't.  Sure, they could wire up the extension, but the moment they go
to flip a client buffer to the screen directly, they discover that KMS
doesn't support any explicit sync APIs.  So, yes, they can technically
implement the extension assuming the EGL stack they're running on has
the sync_file extensions but any client buffers which come in using
the explicit sync Wayland extension have to be composited and can't be
scanned out directly.  As a 3D driver developer, I absolutely don't
want compositors doing that because my users will complain about
performance issues due to the extra blit.

Ok, so let's say we get KMS wired up with implicit sync.  That solves
all our problems, right?  It does, right up until someone decides that
they wan to screen capture their Wayland session via some hardware
media encoder that doesn't support explicit sync.  Now we have to
plumb it all the way through the media stack, gstreamer, etc.  Great,
so let's do that!  Oh, but gstreamer won't want to plumb it through
until they're guaranteed that they can use explicit sync when
displaying on X11 or Wayland.  Are you seeing the problem?

To make matters worse, since most things are doing implicit
synchronization today, it's really easy to get your explicit
synchronization wrong and never notice.  If you forget to pass a
sync_file into one place (say you never notice KMS doesn't support
them), it will probably work anyway thanks to all the implicit sync
that's going on elsewhere.

So, clearly, we all need to go write piles of code that we can't
actually properly test until everyone else has written their piece and
then we use explicit sync if and only if all components support it.
Really?  We're going to do multiple years of development and then just
hope it works when we finally flip the switch?  That doesn't sound
like a good plan to me.


## A proposal: Implicit and explicit sync together

How to solve all these chicken-and-egg problems is something I've been
giving quite a bit of thought (and talking with many others about) in
the last couple of years.  One motivation for this is that we have to
deal with a mismatch in Vulkan.  Another motivation is that I'm
becoming increasingly unhappy with the way that synchronization,
memory residency, and command submission are inherently intertwined in
i915 and would like to break things apart.  Towards that end, I have
an actual proposal.

A couple weeks ago, I sent a series of patches to the dri-devel
mailing list which adds a pair of new ioctls to dma-buf which allow
userspace to manually import or export a sync_file from a dma-buf.
The idea is that something like a Wayland compositor can switch to
100% explicit sync internally once the ioctl is available.  If it gets
buffers in from a client that doesn't use the explicit sync extension,
it can pull a sync_file from the dma-buf and use that exactly as it
would a sync_file passed via the explicit sync extension.  When it
goes to scan out a user buffer and discovers that KMS doesn't accept
sync_files (or if it tries to use that pesky media encoder no one has
converted), it can take it's sync_file for display and stuff it into
the dma-buf before handing it to KMS.

Along with the kernel patches, I've also implemented support for this
in the Vulkan WSI code used by ANV and RADV.  With those patches, the
only requirement on the Vulkan drivers is that you be able to export
any VkSemaphore as a sync_file and temporarily import a sync_file into
any VkFence or VkSemaphore.  As long as that works, the core Vulkan
driver only ever sees explicit synchronization via sync_file.  The WSI
code uses these new ioctls to translate the implicit sync of X11 and
Wayland to the explicit sync the Vulkan driver wants.

I'm hoping (and here's where I want a sanity check) that a simple API
like this will allow us to finally start moving the Linux ecosystem
over to explicit synchronization one piece at a time in a way that's
actually correct.  (No Wayland explicit sync with compositors hoping
KMS magically works even though it doesn't have a sync_file API.)
Once some pieces in the ecosystem start moving, there will be
motivation to start moving others and maybe we can actually build the
momentum to get most everything converted.

For reference, you can find the kernel RFC patches and mesa MR here:

https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037

At this point, I welcome your thoughts, comments, objections, and
maybe even help/review. :-)

--Jason Ekstrand

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-11 17:31 ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-11 17:31 UTC (permalink / raw)
  To: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

All,

Sorry for casting such a broad net with this one. I'm sure most people
who reply will get at least one mailing list rejection.  However, this
is an issue that affects a LOT of components and that's why it's
thorny to begin with.  Please pardon the length of this e-mail as
well; I promise there's a concrete point/proposal at the end.


Explicit synchronization is the future of graphics and media.  At
least, that seems to be the consensus among all the graphics people
I've talked to.  I had a chat with one of the lead Android graphics
engineers recently who told me that doing explicit sync from the start
was one of the best engineering decisions Android ever made.  It's
also the direction being taken by more modern APIs such as Vulkan.


## What are implicit and explicit synchronization?

For those that aren't familiar with this space, GPUs, media encoders,
etc. are massively parallel and synchronization of some form is
required to ensure that everything happens in the right order and
avoid data races.  Implicit synchronization is when bits of work (3D,
compute, video encode, etc.) are implicitly based on the absolute
CPU-time order in which API calls occur.  Explicit synchronization is
when the client (whatever that means in any given context) provides
the dependency graph explicitly via some sort of synchronization
primitives.  If you're still confused, consider the following
examples:

With OpenGL and EGL, almost everything is implicit sync.  Say you have
two OpenGL contexts sharing an image where one writes to it and the
other textures from it.  The way the OpenGL spec works, the client has
to make the API calls to render to the image before (in CPU time) it
makes the API calls which texture from the image.  As long as it does
this (and maybe inserts a glFlush?), the driver will ensure that the
rendering completes before the texturing happens and you get correct
contents.

Implicit synchronization can also happen across processes.  Wayland,
for instance, is currently built on implicit sync where the client
does their rendering and then does a hand-off (via wl_surface::commit)
to tell the compositor it's done at which point the compositor can now
texture from the surface.  The hand-off ensures that the client's
OpenGL API calls happen before the server's OpenGL API calls.

A good example of explicit synchronization is the Vulkan API.  There,
a client (or multiple clients) can simultaneously build command
buffers in different threads where one of those command buffers
renders to an image and the other textures from it and then submit
both of them at the same time with instructions to the driver for
which order to execute them in.  The execution order is described via
the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
extension, you can even submit the work which does the texturing
BEFORE the work which does the rendering and the driver will sort it
out.

The #1 problem with implicit synchronization (which explicit solves)
is that it leads to a lot of over-synchronization both in client space
and in driver/device space.  The client has to synchronize a lot more
because it has to ensure that the API calls happen in a particular
order.  The driver/device have to synchronize a lot more because they
never know what is going to end up being a synchronization point as an
API call on another thread/process may occur at any time.  As we move
to more and more multi-threaded programming this synchronization (on
the client-side especially) becomes more and more painful.


## Current status in Linux

Implicit synchronization in Linux works via a the kernel's internal
dma_buf and dma_fence data structures.  A dma_fence is a tiny object
which represents the "done" status for some bit of work.  Typically,
dma_fences are created as a by-product of someone submitting some bit
of work (say, 3D rendering) to the kernel.  The dma_buf object has a
set of dma_fences on it representing shared (read) and exclusive
(write) access to the object.  When work is submitted which, for
instance renders to the dma_buf, it's queued waiting on all the fences
on the dma_buf and and a dma_fence is created representing the end of
said rendering work and it's installed as the dma_buf's exclusive
fence.  This way, the kernel can manage all its internal queues (3D
rendering, display, video encode, etc.) and know which things to
submit in what order.

For the last few years, we've had sync_file in the kernel and it's
plumbed into some drivers.  A sync_file is just a wrapper around a
single dma_fence.  A sync_file is typically created as a by-product of
submitting work (3D, compute, etc.) to the kernel and is signaled when
that work completes.  When a sync_file is created, it is guaranteed by
the kernel that it will become signaled in finite time and, once it's
signaled, it remains signaled for the rest of time.  A sync_file is
represented in UAPIs as a file descriptor and can be used with normal
file APIs such as dup().  It can be passed into another UAPI which
does some bit of queue'd work and the submitted work will wait for the
sync_file to be triggered before executing.  A sync_file also supports
poll() if  you want to wait on it manually.

Unfortunately, sync_file is not broadly used and not all kernel GPU
drivers support it.  Here's a very quick overview of my understanding
of the status of various components (I don't know the status of
anything in the media world):

 - Vulkan: Explicit synchronization all the way but we have to go
implicit as soon as we interact with a window-system.  Vulkan has APIs
to import/export sync_files to/from it's VkSemaphore and VkFence
synchronization primitives.
 - OpenGL: Implicit all the way.  There are some EGL extensions to
enable some forms of explicit sync via sync_file but OpenGL itself is
still implicit.
 - Wayland: Currently depends on implicit sync in the kernel (accessed
via EGL/OpenGL).  There is an unstable extension to allow passing
sync_files around but it's questionable how useful it is right now
(more on that later).
 - X11: With present, it has these "explicit" fence objects but
they're always a shmfence which lets the X server and client do a
userspace CPU-side hand-off without going over the socket (and
round-tripping through the kernel).  However, the only thing that
fence does is order the OpenGL API calls in the client and server and
the real synchronization is still implicit.
 - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
 - linux/amdgpu: Supports sync_file and syncobj but it still
implicitly syncs sometimes due to it's internal memory residency
handling which can lead to over-synchronization.
 - KMS: Implicit sync all the way.  There are no KMS APIs which take
explicit sync primitives.
 - v4l: ???
 - gstreamer: ???
 - Media APIs such as vaapi etc.:  ???


## Chicken and egg problems

Ok, this is where it starts getting depressing.  I made the claim
above that Wayland has an explicit synchronization protocol that's of
questionable usefulness.  I would claim that basically any bit of
plumbing we do through window systems is currently of questionable
usefulness.  Why?

From my perspective, as a Vulkan driver developer, I have to deal with
the fact that Vulkan is an explicit sync API but Wayland and X11
aren't.  Unfortunately, the Wayland extension solves zero problems for
me because I can't really use it unless it's implemented in all of the
compositors.  Until every Wayland compositor I care about my users
being able to use (which is basically all of them) supports the
extension, I have to continue carry around my pile of hacks to keep
implicit sync and Vulkan working nicely together.

From the perspective of a Wayland compositor (I used to play in this
space), they'd love to implement the new explicit sync extension but
can't.  Sure, they could wire up the extension, but the moment they go
to flip a client buffer to the screen directly, they discover that KMS
doesn't support any explicit sync APIs.  So, yes, they can technically
implement the extension assuming the EGL stack they're running on has
the sync_file extensions but any client buffers which come in using
the explicit sync Wayland extension have to be composited and can't be
scanned out directly.  As a 3D driver developer, I absolutely don't
want compositors doing that because my users will complain about
performance issues due to the extra blit.

Ok, so let's say we get KMS wired up with implicit sync.  That solves
all our problems, right?  It does, right up until someone decides that
they wan to screen capture their Wayland session via some hardware
media encoder that doesn't support explicit sync.  Now we have to
plumb it all the way through the media stack, gstreamer, etc.  Great,
so let's do that!  Oh, but gstreamer won't want to plumb it through
until they're guaranteed that they can use explicit sync when
displaying on X11 or Wayland.  Are you seeing the problem?

To make matters worse, since most things are doing implicit
synchronization today, it's really easy to get your explicit
synchronization wrong and never notice.  If you forget to pass a
sync_file into one place (say you never notice KMS doesn't support
them), it will probably work anyway thanks to all the implicit sync
that's going on elsewhere.

So, clearly, we all need to go write piles of code that we can't
actually properly test until everyone else has written their piece and
then we use explicit sync if and only if all components support it.
Really?  We're going to do multiple years of development and then just
hope it works when we finally flip the switch?  That doesn't sound
like a good plan to me.


## A proposal: Implicit and explicit sync together

How to solve all these chicken-and-egg problems is something I've been
giving quite a bit of thought (and talking with many others about) in
the last couple of years.  One motivation for this is that we have to
deal with a mismatch in Vulkan.  Another motivation is that I'm
becoming increasingly unhappy with the way that synchronization,
memory residency, and command submission are inherently intertwined in
i915 and would like to break things apart.  Towards that end, I have
an actual proposal.

A couple weeks ago, I sent a series of patches to the dri-devel
mailing list which adds a pair of new ioctls to dma-buf which allow
userspace to manually import or export a sync_file from a dma-buf.
The idea is that something like a Wayland compositor can switch to
100% explicit sync internally once the ioctl is available.  If it gets
buffers in from a client that doesn't use the explicit sync extension,
it can pull a sync_file from the dma-buf and use that exactly as it
would a sync_file passed via the explicit sync extension.  When it
goes to scan out a user buffer and discovers that KMS doesn't accept
sync_files (or if it tries to use that pesky media encoder no one has
converted), it can take it's sync_file for display and stuff it into
the dma-buf before handing it to KMS.

Along with the kernel patches, I've also implemented support for this
in the Vulkan WSI code used by ANV and RADV.  With those patches, the
only requirement on the Vulkan drivers is that you be able to export
any VkSemaphore as a sync_file and temporarily import a sync_file into
any VkFence or VkSemaphore.  As long as that works, the core Vulkan
driver only ever sees explicit synchronization via sync_file.  The WSI
code uses these new ioctls to translate the implicit sync of X11 and
Wayland to the explicit sync the Vulkan driver wants.

I'm hoping (and here's where I want a sanity check) that a simple API
like this will allow us to finally start moving the Linux ecosystem
over to explicit synchronization one piece at a time in a way that's
actually correct.  (No Wayland explicit sync with compositors hoping
KMS magically works even though it doesn't have a sync_file API.)
Once some pieces in the ecosystem start moving, there will be
motivation to start moving others and maybe we can actually build the
momentum to get most everything converted.

For reference, you can find the kernel RFC patches and mesa MR here:

https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037

At this point, I welcome your thoughts, comments, objections, and
maybe even help/review. :-)

--Jason Ekstrand
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 17:31 ` Jason Ekstrand
@ 2020-03-11 19:21   ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-11 19:21 UTC (permalink / raw)
  To: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> All,
>
> Sorry for casting such a broad net with this one. I'm sure most people
> who reply will get at least one mailing list rejection.  However, this
> is an issue that affects a LOT of components and that's why it's
> thorny to begin with.  Please pardon the length of this e-mail as
> well; I promise there's a concrete point/proposal at the end.
>
>
> Explicit synchronization is the future of graphics and media.  At
> least, that seems to be the consensus among all the graphics people
> I've talked to.  I had a chat with one of the lead Android graphics
> engineers recently who told me that doing explicit sync from the start
> was one of the best engineering decisions Android ever made.  It's
> also the direction being taken by more modern APIs such as Vulkan.
>
>
> ## What are implicit and explicit synchronization?
>
> For those that aren't familiar with this space, GPUs, media encoders,
> etc. are massively parallel and synchronization of some form is
> required to ensure that everything happens in the right order and
> avoid data races.  Implicit synchronization is when bits of work (3D,
> compute, video encode, etc.) are implicitly based on the absolute
> CPU-time order in which API calls occur.  Explicit synchronization is
> when the client (whatever that means in any given context) provides
> the dependency graph explicitly via some sort of synchronization
> primitives.  If you're still confused, consider the following
> examples:
>
> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> two OpenGL contexts sharing an image where one writes to it and the
> other textures from it.  The way the OpenGL spec works, the client has
> to make the API calls to render to the image before (in CPU time) it
> makes the API calls which texture from the image.  As long as it does
> this (and maybe inserts a glFlush?), the driver will ensure that the
> rendering completes before the texturing happens and you get correct
> contents.
>
> Implicit synchronization can also happen across processes.  Wayland,
> for instance, is currently built on implicit sync where the client
> does their rendering and then does a hand-off (via wl_surface::commit)
> to tell the compositor it's done at which point the compositor can now
> texture from the surface.  The hand-off ensures that the client's
> OpenGL API calls happen before the server's OpenGL API calls.
>
> A good example of explicit synchronization is the Vulkan API.  There,
> a client (or multiple clients) can simultaneously build command
> buffers in different threads where one of those command buffers
> renders to an image and the other textures from it and then submit
> both of them at the same time with instructions to the driver for
> which order to execute them in.  The execution order is described via
> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> extension, you can even submit the work which does the texturing
> BEFORE the work which does the rendering and the driver will sort it
> out.
>
> The #1 problem with implicit synchronization (which explicit solves)
> is that it leads to a lot of over-synchronization both in client space
> and in driver/device space.  The client has to synchronize a lot more
> because it has to ensure that the API calls happen in a particular
> order.  The driver/device have to synchronize a lot more because they
> never know what is going to end up being a synchronization point as an
> API call on another thread/process may occur at any time.  As we move
> to more and more multi-threaded programming this synchronization (on
> the client-side especially) becomes more and more painful.
>
>
> ## Current status in Linux
>
> Implicit synchronization in Linux works via a the kernel's internal
> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> which represents the "done" status for some bit of work.  Typically,
> dma_fences are created as a by-product of someone submitting some bit
> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> set of dma_fences on it representing shared (read) and exclusive
> (write) access to the object.  When work is submitted which, for
> instance renders to the dma_buf, it's queued waiting on all the fences
> on the dma_buf and and a dma_fence is created representing the end of
> said rendering work and it's installed as the dma_buf's exclusive
> fence.  This way, the kernel can manage all its internal queues (3D
> rendering, display, video encode, etc.) and know which things to
> submit in what order.
>
> For the last few years, we've had sync_file in the kernel and it's
> plumbed into some drivers.  A sync_file is just a wrapper around a
> single dma_fence.  A sync_file is typically created as a by-product of
> submitting work (3D, compute, etc.) to the kernel and is signaled when
> that work completes.  When a sync_file is created, it is guaranteed by
> the kernel that it will become signaled in finite time and, once it's
> signaled, it remains signaled for the rest of time.  A sync_file is
> represented in UAPIs as a file descriptor and can be used with normal
> file APIs such as dup().  It can be passed into another UAPI which
> does some bit of queue'd work and the submitted work will wait for the
> sync_file to be triggered before executing.  A sync_file also supports
> poll() if  you want to wait on it manually.
>
> Unfortunately, sync_file is not broadly used and not all kernel GPU
> drivers support it.  Here's a very quick overview of my understanding
> of the status of various components (I don't know the status of
> anything in the media world):
>
>  - Vulkan: Explicit synchronization all the way but we have to go
> implicit as soon as we interact with a window-system.  Vulkan has APIs
> to import/export sync_files to/from it's VkSemaphore and VkFence
> synchronization primitives.
>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> enable some forms of explicit sync via sync_file but OpenGL itself is
> still implicit.
>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> via EGL/OpenGL).  There is an unstable extension to allow passing
> sync_files around but it's questionable how useful it is right now
> (more on that later).
>  - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.
>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
>  - linux/amdgpu: Supports sync_file and syncobj but it still
> implicitly syncs sometimes due to it's internal memory residency
> handling which can lead to over-synchronization.
>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> explicit sync primitives.

Correction:  Apparently, I missed some things.  If you use atomic, KMS
does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
are still in trouble but most Wayland compositors use atomic these
days

>  - v4l: ???
>  - gstreamer: ???
>  - Media APIs such as vaapi etc.:  ???
>
>
> ## Chicken and egg problems
>
> Ok, this is where it starts getting depressing.  I made the claim
> above that Wayland has an explicit synchronization protocol that's of
> questionable usefulness.  I would claim that basically any bit of
> plumbing we do through window systems is currently of questionable
> usefulness.  Why?
>
> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.  Unfortunately, the Wayland extension solves zero problems for
> me because I can't really use it unless it's implemented in all of the
> compositors.  Until every Wayland compositor I care about my users
> being able to use (which is basically all of them) supports the
> extension, I have to continue carry around my pile of hacks to keep
> implicit sync and Vulkan working nicely together.
>
> From the perspective of a Wayland compositor (I used to play in this
> space), they'd love to implement the new explicit sync extension but
> can't.  Sure, they could wire up the extension, but the moment they go
> to flip a client buffer to the screen directly, they discover that KMS
> doesn't support any explicit sync APIs.

As per the above correction, Wayland compositors aren't nearly as bad
off as I initially thought.  There may still be weird screen capture
cases but the normal cases of compositing and displaying via
KMS/atomic should be in reasonably good shape.

> So, yes, they can technically
> implement the extension assuming the EGL stack they're running on has
> the sync_file extensions but any client buffers which come in using
> the explicit sync Wayland extension have to be composited and can't be
> scanned out directly.  As a 3D driver developer, I absolutely don't
> want compositors doing that because my users will complain about
> performance issues due to the extra blit.
>
> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> all our problems, right?  It does, right up until someone decides that
> they wan to screen capture their Wayland session via some hardware
> media encoder that doesn't support explicit sync.  Now we have to
> plumb it all the way through the media stack, gstreamer, etc.  Great,
> so let's do that!  Oh, but gstreamer won't want to plumb it through
> until they're guaranteed that they can use explicit sync when
> displaying on X11 or Wayland.  Are you seeing the problem?
>
> To make matters worse, since most things are doing implicit
> synchronization today, it's really easy to get your explicit
> synchronization wrong and never notice.  If you forget to pass a
> sync_file into one place (say you never notice KMS doesn't support
> them), it will probably work anyway thanks to all the implicit sync
> that's going on elsewhere.
>
> So, clearly, we all need to go write piles of code that we can't
> actually properly test until everyone else has written their piece and
> then we use explicit sync if and only if all components support it.
> Really?  We're going to do multiple years of development and then just
> hope it works when we finally flip the switch?  That doesn't sound
> like a good plan to me.
>
>
> ## A proposal: Implicit and explicit sync together
>
> How to solve all these chicken-and-egg problems is something I've been
> giving quite a bit of thought (and talking with many others about) in
> the last couple of years.  One motivation for this is that we have to
> deal with a mismatch in Vulkan.  Another motivation is that I'm
> becoming increasingly unhappy with the way that synchronization,
> memory residency, and command submission are inherently intertwined in
> i915 and would like to break things apart.  Towards that end, I have
> an actual proposal.
>
> A couple weeks ago, I sent a series of patches to the dri-devel
> mailing list which adds a pair of new ioctls to dma-buf which allow
> userspace to manually import or export a sync_file from a dma-buf.
> The idea is that something like a Wayland compositor can switch to
> 100% explicit sync internally once the ioctl is available.  If it gets
> buffers in from a client that doesn't use the explicit sync extension,
> it can pull a sync_file from the dma-buf and use that exactly as it
> would a sync_file passed via the explicit sync extension.  When it
> goes to scan out a user buffer and discovers that KMS doesn't accept
> sync_files (or if it tries to use that pesky media encoder no one has
> converted), it can take it's sync_file for display and stuff it into
> the dma-buf before handing it to KMS.
>
> Along with the kernel patches, I've also implemented support for this
> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> only requirement on the Vulkan drivers is that you be able to export
> any VkSemaphore as a sync_file and temporarily import a sync_file into
> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> driver only ever sees explicit synchronization via sync_file.  The WSI
> code uses these new ioctls to translate the implicit sync of X11 and
> Wayland to the explicit sync the Vulkan driver wants.
>
> I'm hoping (and here's where I want a sanity check) that a simple API
> like this will allow us to finally start moving the Linux ecosystem
> over to explicit synchronization one piece at a time in a way that's
> actually correct.  (No Wayland explicit sync with compositors hoping
> KMS magically works even though it doesn't have a sync_file API.)
> Once some pieces in the ecosystem start moving, there will be
> motivation to start moving others and maybe we can actually build the
> momentum to get most everything converted.
>
> For reference, you can find the kernel RFC patches and mesa MR here:
>
> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
>
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
>
> At this point, I welcome your thoughts, comments, objections, and
> maybe even help/review. :-)
>
> --Jason Ekstrand

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-11 19:21   ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-11 19:21 UTC (permalink / raw)
  To: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> All,
>
> Sorry for casting such a broad net with this one. I'm sure most people
> who reply will get at least one mailing list rejection.  However, this
> is an issue that affects a LOT of components and that's why it's
> thorny to begin with.  Please pardon the length of this e-mail as
> well; I promise there's a concrete point/proposal at the end.
>
>
> Explicit synchronization is the future of graphics and media.  At
> least, that seems to be the consensus among all the graphics people
> I've talked to.  I had a chat with one of the lead Android graphics
> engineers recently who told me that doing explicit sync from the start
> was one of the best engineering decisions Android ever made.  It's
> also the direction being taken by more modern APIs such as Vulkan.
>
>
> ## What are implicit and explicit synchronization?
>
> For those that aren't familiar with this space, GPUs, media encoders,
> etc. are massively parallel and synchronization of some form is
> required to ensure that everything happens in the right order and
> avoid data races.  Implicit synchronization is when bits of work (3D,
> compute, video encode, etc.) are implicitly based on the absolute
> CPU-time order in which API calls occur.  Explicit synchronization is
> when the client (whatever that means in any given context) provides
> the dependency graph explicitly via some sort of synchronization
> primitives.  If you're still confused, consider the following
> examples:
>
> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> two OpenGL contexts sharing an image where one writes to it and the
> other textures from it.  The way the OpenGL spec works, the client has
> to make the API calls to render to the image before (in CPU time) it
> makes the API calls which texture from the image.  As long as it does
> this (and maybe inserts a glFlush?), the driver will ensure that the
> rendering completes before the texturing happens and you get correct
> contents.
>
> Implicit synchronization can also happen across processes.  Wayland,
> for instance, is currently built on implicit sync where the client
> does their rendering and then does a hand-off (via wl_surface::commit)
> to tell the compositor it's done at which point the compositor can now
> texture from the surface.  The hand-off ensures that the client's
> OpenGL API calls happen before the server's OpenGL API calls.
>
> A good example of explicit synchronization is the Vulkan API.  There,
> a client (or multiple clients) can simultaneously build command
> buffers in different threads where one of those command buffers
> renders to an image and the other textures from it and then submit
> both of them at the same time with instructions to the driver for
> which order to execute them in.  The execution order is described via
> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> extension, you can even submit the work which does the texturing
> BEFORE the work which does the rendering and the driver will sort it
> out.
>
> The #1 problem with implicit synchronization (which explicit solves)
> is that it leads to a lot of over-synchronization both in client space
> and in driver/device space.  The client has to synchronize a lot more
> because it has to ensure that the API calls happen in a particular
> order.  The driver/device have to synchronize a lot more because they
> never know what is going to end up being a synchronization point as an
> API call on another thread/process may occur at any time.  As we move
> to more and more multi-threaded programming this synchronization (on
> the client-side especially) becomes more and more painful.
>
>
> ## Current status in Linux
>
> Implicit synchronization in Linux works via a the kernel's internal
> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> which represents the "done" status for some bit of work.  Typically,
> dma_fences are created as a by-product of someone submitting some bit
> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> set of dma_fences on it representing shared (read) and exclusive
> (write) access to the object.  When work is submitted which, for
> instance renders to the dma_buf, it's queued waiting on all the fences
> on the dma_buf and and a dma_fence is created representing the end of
> said rendering work and it's installed as the dma_buf's exclusive
> fence.  This way, the kernel can manage all its internal queues (3D
> rendering, display, video encode, etc.) and know which things to
> submit in what order.
>
> For the last few years, we've had sync_file in the kernel and it's
> plumbed into some drivers.  A sync_file is just a wrapper around a
> single dma_fence.  A sync_file is typically created as a by-product of
> submitting work (3D, compute, etc.) to the kernel and is signaled when
> that work completes.  When a sync_file is created, it is guaranteed by
> the kernel that it will become signaled in finite time and, once it's
> signaled, it remains signaled for the rest of time.  A sync_file is
> represented in UAPIs as a file descriptor and can be used with normal
> file APIs such as dup().  It can be passed into another UAPI which
> does some bit of queue'd work and the submitted work will wait for the
> sync_file to be triggered before executing.  A sync_file also supports
> poll() if  you want to wait on it manually.
>
> Unfortunately, sync_file is not broadly used and not all kernel GPU
> drivers support it.  Here's a very quick overview of my understanding
> of the status of various components (I don't know the status of
> anything in the media world):
>
>  - Vulkan: Explicit synchronization all the way but we have to go
> implicit as soon as we interact with a window-system.  Vulkan has APIs
> to import/export sync_files to/from it's VkSemaphore and VkFence
> synchronization primitives.
>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> enable some forms of explicit sync via sync_file but OpenGL itself is
> still implicit.
>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> via EGL/OpenGL).  There is an unstable extension to allow passing
> sync_files around but it's questionable how useful it is right now
> (more on that later).
>  - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.
>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
>  - linux/amdgpu: Supports sync_file and syncobj but it still
> implicitly syncs sometimes due to it's internal memory residency
> handling which can lead to over-synchronization.
>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> explicit sync primitives.

Correction:  Apparently, I missed some things.  If you use atomic, KMS
does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
are still in trouble but most Wayland compositors use atomic these
days

>  - v4l: ???
>  - gstreamer: ???
>  - Media APIs such as vaapi etc.:  ???
>
>
> ## Chicken and egg problems
>
> Ok, this is where it starts getting depressing.  I made the claim
> above that Wayland has an explicit synchronization protocol that's of
> questionable usefulness.  I would claim that basically any bit of
> plumbing we do through window systems is currently of questionable
> usefulness.  Why?
>
> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.  Unfortunately, the Wayland extension solves zero problems for
> me because I can't really use it unless it's implemented in all of the
> compositors.  Until every Wayland compositor I care about my users
> being able to use (which is basically all of them) supports the
> extension, I have to continue carry around my pile of hacks to keep
> implicit sync and Vulkan working nicely together.
>
> From the perspective of a Wayland compositor (I used to play in this
> space), they'd love to implement the new explicit sync extension but
> can't.  Sure, they could wire up the extension, but the moment they go
> to flip a client buffer to the screen directly, they discover that KMS
> doesn't support any explicit sync APIs.

As per the above correction, Wayland compositors aren't nearly as bad
off as I initially thought.  There may still be weird screen capture
cases but the normal cases of compositing and displaying via
KMS/atomic should be in reasonably good shape.

> So, yes, they can technically
> implement the extension assuming the EGL stack they're running on has
> the sync_file extensions but any client buffers which come in using
> the explicit sync Wayland extension have to be composited and can't be
> scanned out directly.  As a 3D driver developer, I absolutely don't
> want compositors doing that because my users will complain about
> performance issues due to the extra blit.
>
> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> all our problems, right?  It does, right up until someone decides that
> they wan to screen capture their Wayland session via some hardware
> media encoder that doesn't support explicit sync.  Now we have to
> plumb it all the way through the media stack, gstreamer, etc.  Great,
> so let's do that!  Oh, but gstreamer won't want to plumb it through
> until they're guaranteed that they can use explicit sync when
> displaying on X11 or Wayland.  Are you seeing the problem?
>
> To make matters worse, since most things are doing implicit
> synchronization today, it's really easy to get your explicit
> synchronization wrong and never notice.  If you forget to pass a
> sync_file into one place (say you never notice KMS doesn't support
> them), it will probably work anyway thanks to all the implicit sync
> that's going on elsewhere.
>
> So, clearly, we all need to go write piles of code that we can't
> actually properly test until everyone else has written their piece and
> then we use explicit sync if and only if all components support it.
> Really?  We're going to do multiple years of development and then just
> hope it works when we finally flip the switch?  That doesn't sound
> like a good plan to me.
>
>
> ## A proposal: Implicit and explicit sync together
>
> How to solve all these chicken-and-egg problems is something I've been
> giving quite a bit of thought (and talking with many others about) in
> the last couple of years.  One motivation for this is that we have to
> deal with a mismatch in Vulkan.  Another motivation is that I'm
> becoming increasingly unhappy with the way that synchronization,
> memory residency, and command submission are inherently intertwined in
> i915 and would like to break things apart.  Towards that end, I have
> an actual proposal.
>
> A couple weeks ago, I sent a series of patches to the dri-devel
> mailing list which adds a pair of new ioctls to dma-buf which allow
> userspace to manually import or export a sync_file from a dma-buf.
> The idea is that something like a Wayland compositor can switch to
> 100% explicit sync internally once the ioctl is available.  If it gets
> buffers in from a client that doesn't use the explicit sync extension,
> it can pull a sync_file from the dma-buf and use that exactly as it
> would a sync_file passed via the explicit sync extension.  When it
> goes to scan out a user buffer and discovers that KMS doesn't accept
> sync_files (or if it tries to use that pesky media encoder no one has
> converted), it can take it's sync_file for display and stuff it into
> the dma-buf before handing it to KMS.
>
> Along with the kernel patches, I've also implemented support for this
> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> only requirement on the Vulkan drivers is that you be able to export
> any VkSemaphore as a sync_file and temporarily import a sync_file into
> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> driver only ever sees explicit synchronization via sync_file.  The WSI
> code uses these new ioctls to translate the implicit sync of X11 and
> Wayland to the explicit sync the Vulkan driver wants.
>
> I'm hoping (and here's where I want a sanity check) that a simple API
> like this will allow us to finally start moving the Linux ecosystem
> over to explicit synchronization one piece at a time in a way that's
> actually correct.  (No Wayland explicit sync with compositors hoping
> KMS magically works even though it doesn't have a sync_file API.)
> Once some pieces in the ecosystem start moving, there will be
> motivation to start moving others and maybe we can actually build the
> momentum to get most everything converted.
>
> For reference, you can find the kernel RFC patches and mesa MR here:
>
> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
>
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
>
> At this point, I welcome your thoughts, comments, objections, and
> maybe even help/review. :-)
>
> --Jason Ekstrand
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 19:21   ` Jason Ekstrand
@ 2020-03-11 20:18     ` Nicolas Dufresne
  -1 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-11 20:18 UTC (permalink / raw)
  To: Jason Ekstrand, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

(I know I'm going to be spammed by so many mailing list ...)

Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > All,
> > 
> > Sorry for casting such a broad net with this one. I'm sure most people
> > who reply will get at least one mailing list rejection.  However, this
> > is an issue that affects a LOT of components and that's why it's
> > thorny to begin with.  Please pardon the length of this e-mail as
> > well; I promise there's a concrete point/proposal at the end.
> > 
> > 
> > Explicit synchronization is the future of graphics and media.  At
> > least, that seems to be the consensus among all the graphics people
> > I've talked to.  I had a chat with one of the lead Android graphics
> > engineers recently who told me that doing explicit sync from the start
> > was one of the best engineering decisions Android ever made.  It's
> > also the direction being taken by more modern APIs such as Vulkan.
> > 
> > 
> > ## What are implicit and explicit synchronization?
> > 
> > For those that aren't familiar with this space, GPUs, media encoders,
> > etc. are massively parallel and synchronization of some form is
> > required to ensure that everything happens in the right order and
> > avoid data races.  Implicit synchronization is when bits of work (3D,
> > compute, video encode, etc.) are implicitly based on the absolute
> > CPU-time order in which API calls occur.  Explicit synchronization is
> > when the client (whatever that means in any given context) provides
> > the dependency graph explicitly via some sort of synchronization
> > primitives.  If you're still confused, consider the following
> > examples:
> > 
> > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > two OpenGL contexts sharing an image where one writes to it and the
> > other textures from it.  The way the OpenGL spec works, the client has
> > to make the API calls to render to the image before (in CPU time) it
> > makes the API calls which texture from the image.  As long as it does
> > this (and maybe inserts a glFlush?), the driver will ensure that the
> > rendering completes before the texturing happens and you get correct
> > contents.
> > 
> > Implicit synchronization can also happen across processes.  Wayland,
> > for instance, is currently built on implicit sync where the client
> > does their rendering and then does a hand-off (via wl_surface::commit)
> > to tell the compositor it's done at which point the compositor can now
> > texture from the surface.  The hand-off ensures that the client's
> > OpenGL API calls happen before the server's OpenGL API calls.
> > 
> > A good example of explicit synchronization is the Vulkan API.  There,
> > a client (or multiple clients) can simultaneously build command
> > buffers in different threads where one of those command buffers
> > renders to an image and the other textures from it and then submit
> > both of them at the same time with instructions to the driver for
> > which order to execute them in.  The execution order is described via
> > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > extension, you can even submit the work which does the texturing
> > BEFORE the work which does the rendering and the driver will sort it
> > out.
> > 
> > The #1 problem with implicit synchronization (which explicit solves)
> > is that it leads to a lot of over-synchronization both in client space
> > and in driver/device space.  The client has to synchronize a lot more
> > because it has to ensure that the API calls happen in a particular
> > order.  The driver/device have to synchronize a lot more because they
> > never know what is going to end up being a synchronization point as an
> > API call on another thread/process may occur at any time.  As we move
> > to more and more multi-threaded programming this synchronization (on
> > the client-side especially) becomes more and more painful.
> > 
> > 
> > ## Current status in Linux
> > 
> > Implicit synchronization in Linux works via a the kernel's internal
> > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > which represents the "done" status for some bit of work.  Typically,
> > dma_fences are created as a by-product of someone submitting some bit
> > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > set of dma_fences on it representing shared (read) and exclusive
> > (write) access to the object.  When work is submitted which, for
> > instance renders to the dma_buf, it's queued waiting on all the fences
> > on the dma_buf and and a dma_fence is created representing the end of
> > said rendering work and it's installed as the dma_buf's exclusive
> > fence.  This way, the kernel can manage all its internal queues (3D
> > rendering, display, video encode, etc.) and know which things to
> > submit in what order.
> > 
> > For the last few years, we've had sync_file in the kernel and it's
> > plumbed into some drivers.  A sync_file is just a wrapper around a
> > single dma_fence.  A sync_file is typically created as a by-product of
> > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > that work completes.  When a sync_file is created, it is guaranteed by
> > the kernel that it will become signaled in finite time and, once it's
> > signaled, it remains signaled for the rest of time.  A sync_file is
> > represented in UAPIs as a file descriptor and can be used with normal
> > file APIs such as dup().  It can be passed into another UAPI which
> > does some bit of queue'd work and the submitted work will wait for the
> > sync_file to be triggered before executing.  A sync_file also supports
> > poll() if  you want to wait on it manually.
> > 
> > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > drivers support it.  Here's a very quick overview of my understanding
> > of the status of various components (I don't know the status of
> > anything in the media world):
> > 
> >  - Vulkan: Explicit synchronization all the way but we have to go
> > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > to import/export sync_files to/from it's VkSemaphore and VkFence
> > synchronization primitives.
> >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > enable some forms of explicit sync via sync_file but OpenGL itself is
> > still implicit.
> >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > via EGL/OpenGL).  There is an unstable extension to allow passing
> > sync_files around but it's questionable how useful it is right now
> > (more on that later).
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
> >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > sync.
> >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > implicitly syncs sometimes due to it's internal memory residency
> > handling which can lead to over-synchronization.
> >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > explicit sync primitives.
> 
> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> are still in trouble but most Wayland compositors use atomic these
> days
> 
> >  - v4l: ???
> >  - gstreamer: ???
> >  - Media APIs such as vaapi etc.:  ???

GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
synchronisation is something we do already with GL (even if limited). We place
GLSync object in the pipeline and attach that on related GstBuffer. We wait on
these GLSync as late as possible (or superseed the sync if we queue more work
into the same GL context). That requires a special mode of operation of course.
We don't usually like making lazy blocking call implicit, as it tends to cause
random issues. If we need to wait, we think it's better to wait int he module
that is responsible, so in general, we try to negotiate and fallback locally
(it's plugin base, so this can be really messy otherwise).

So basically this problem needs to be solved in V4L2, VAAPI and other lower
level APIs first. We need API that provides us these fence (in or out), and then
we can consider using them. For V4L2, there was an attempt, but it was a bit of
a miss-fit. Your proposal could work, need to be tested I guess, but it does not
solve some of other issues that was discussed. Notably for camera capture, were
the HW timestamp is capture about at the same time the frame is ready. But the
timestamp is not part of the paylaod, so you need an entire API asynchronously
deliver that metadata. It's the biggest pain point I've found, such an API would
be quite invasive or if made really generic, might just never be adopted widely
enough.

There is other elements that would implement fencing, notably kmssink, but no
one actually dared porting it to atomic KMS, so clearly there is very little
comunity interest. glimagsink could clearly benifit. Right now if we import a
DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
uses), so then the operation will just fail where it worked before (breaking
userspace). If it was an explcit fence, we could handle that in GStreamer
cleanly as we do for new APIs.

> > 
> > 
> > ## Chicken and egg problems
> > 
> > Ok, this is where it starts getting depressing.  I made the claim
> > above that Wayland has an explicit synchronization protocol that's of
> > questionable usefulness.  I would claim that basically any bit of
> > plumbing we do through window systems is currently of questionable
> > usefulness.  Why?
> > 
> > From my perspective, as a Vulkan driver developer, I have to deal with
> > the fact that Vulkan is an explicit sync API but Wayland and X11
> > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > me because I can't really use it unless it's implemented in all of the
> > compositors.  Until every Wayland compositor I care about my users
> > being able to use (which is basically all of them) supports the
> > extension, I have to continue carry around my pile of hacks to keep
> > implicit sync and Vulkan working nicely together.
> > 
> > From the perspective of a Wayland compositor (I used to play in this
> > space), they'd love to implement the new explicit sync extension but
> > can't.  Sure, they could wire up the extension, but the moment they go
> > to flip a client buffer to the screen directly, they discover that KMS
> > doesn't support any explicit sync APIs.
> 
> As per the above correction, Wayland compositors aren't nearly as bad
> off as I initially thought.  There may still be weird screen capture
> cases but the normal cases of compositing and displaying via
> KMS/atomic should be in reasonably good shape.
> 
> > So, yes, they can technically
> > implement the extension assuming the EGL stack they're running on has
> > the sync_file extensions but any client buffers which come in using
> > the explicit sync Wayland extension have to be composited and can't be
> > scanned out directly.  As a 3D driver developer, I absolutely don't
> > want compositors doing that because my users will complain about
> > performance issues due to the extra blit.
> > 
> > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > all our problems, right?  It does, right up until someone decides that
> > they wan to screen capture their Wayland session via some hardware
> > media encoder that doesn't support explicit sync.  Now we have to
> > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > until they're guaranteed that they can use explicit sync when
> > displaying on X11 or Wayland.  Are you seeing the problem?
> > 
> > To make matters worse, since most things are doing implicit
> > synchronization today, it's really easy to get your explicit
> > synchronization wrong and never notice.  If you forget to pass a
> > sync_file into one place (say you never notice KMS doesn't support
> > them), it will probably work anyway thanks to all the implicit sync
> > that's going on elsewhere.
> > 
> > So, clearly, we all need to go write piles of code that we can't
> > actually properly test until everyone else has written their piece and
> > then we use explicit sync if and only if all components support it.
> > Really?  We're going to do multiple years of development and then just
> > hope it works when we finally flip the switch?  That doesn't sound
> > like a good plan to me.
> > 
> > 
> > ## A proposal: Implicit and explicit sync together
> > 
> > How to solve all these chicken-and-egg problems is something I've been
> > giving quite a bit of thought (and talking with many others about) in
> > the last couple of years.  One motivation for this is that we have to
> > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > becoming increasingly unhappy with the way that synchronization,
> > memory residency, and command submission are inherently intertwined in
> > i915 and would like to break things apart.  Towards that end, I have
> > an actual proposal.
> > 
> > A couple weeks ago, I sent a series of patches to the dri-devel
> > mailing list which adds a pair of new ioctls to dma-buf which allow
> > userspace to manually import or export a sync_file from a dma-buf.
> > The idea is that something like a Wayland compositor can switch to
> > 100% explicit sync internally once the ioctl is available.  If it gets
> > buffers in from a client that doesn't use the explicit sync extension,
> > it can pull a sync_file from the dma-buf and use that exactly as it
> > would a sync_file passed via the explicit sync extension.  When it
> > goes to scan out a user buffer and discovers that KMS doesn't accept
> > sync_files (or if it tries to use that pesky media encoder no one has
> > converted), it can take it's sync_file for display and stuff it into
> > the dma-buf before handing it to KMS.
> > 
> > Along with the kernel patches, I've also implemented support for this
> > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > only requirement on the Vulkan drivers is that you be able to export
> > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > driver only ever sees explicit synchronization via sync_file.  The WSI
> > code uses these new ioctls to translate the implicit sync of X11 and
> > Wayland to the explicit sync the Vulkan driver wants.
> > 
> > I'm hoping (and here's where I want a sanity check) that a simple API
> > like this will allow us to finally start moving the Linux ecosystem
> > over to explicit synchronization one piece at a time in a way that's
> > actually correct.  (No Wayland explicit sync with compositors hoping
> > KMS magically works even though it doesn't have a sync_file API.)
> > Once some pieces in the ecosystem start moving, there will be
> > motivation to start moving others and maybe we can actually build the
> > momentum to get most everything converted.
> > 
> > For reference, you can find the kernel RFC patches and mesa MR here:
> > 
> > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > 
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > 
> > At this point, I welcome your thoughts, comments, objections, and
> > maybe even help/review. :-)
> > 
> > --Jason Ekstrand


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-11 20:18     ` Nicolas Dufresne
  0 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-11 20:18 UTC (permalink / raw)
  To: Jason Ekstrand, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

(I know I'm going to be spammed by so many mailing list ...)

Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > All,
> > 
> > Sorry for casting such a broad net with this one. I'm sure most people
> > who reply will get at least one mailing list rejection.  However, this
> > is an issue that affects a LOT of components and that's why it's
> > thorny to begin with.  Please pardon the length of this e-mail as
> > well; I promise there's a concrete point/proposal at the end.
> > 
> > 
> > Explicit synchronization is the future of graphics and media.  At
> > least, that seems to be the consensus among all the graphics people
> > I've talked to.  I had a chat with one of the lead Android graphics
> > engineers recently who told me that doing explicit sync from the start
> > was one of the best engineering decisions Android ever made.  It's
> > also the direction being taken by more modern APIs such as Vulkan.
> > 
> > 
> > ## What are implicit and explicit synchronization?
> > 
> > For those that aren't familiar with this space, GPUs, media encoders,
> > etc. are massively parallel and synchronization of some form is
> > required to ensure that everything happens in the right order and
> > avoid data races.  Implicit synchronization is when bits of work (3D,
> > compute, video encode, etc.) are implicitly based on the absolute
> > CPU-time order in which API calls occur.  Explicit synchronization is
> > when the client (whatever that means in any given context) provides
> > the dependency graph explicitly via some sort of synchronization
> > primitives.  If you're still confused, consider the following
> > examples:
> > 
> > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > two OpenGL contexts sharing an image where one writes to it and the
> > other textures from it.  The way the OpenGL spec works, the client has
> > to make the API calls to render to the image before (in CPU time) it
> > makes the API calls which texture from the image.  As long as it does
> > this (and maybe inserts a glFlush?), the driver will ensure that the
> > rendering completes before the texturing happens and you get correct
> > contents.
> > 
> > Implicit synchronization can also happen across processes.  Wayland,
> > for instance, is currently built on implicit sync where the client
> > does their rendering and then does a hand-off (via wl_surface::commit)
> > to tell the compositor it's done at which point the compositor can now
> > texture from the surface.  The hand-off ensures that the client's
> > OpenGL API calls happen before the server's OpenGL API calls.
> > 
> > A good example of explicit synchronization is the Vulkan API.  There,
> > a client (or multiple clients) can simultaneously build command
> > buffers in different threads where one of those command buffers
> > renders to an image and the other textures from it and then submit
> > both of them at the same time with instructions to the driver for
> > which order to execute them in.  The execution order is described via
> > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > extension, you can even submit the work which does the texturing
> > BEFORE the work which does the rendering and the driver will sort it
> > out.
> > 
> > The #1 problem with implicit synchronization (which explicit solves)
> > is that it leads to a lot of over-synchronization both in client space
> > and in driver/device space.  The client has to synchronize a lot more
> > because it has to ensure that the API calls happen in a particular
> > order.  The driver/device have to synchronize a lot more because they
> > never know what is going to end up being a synchronization point as an
> > API call on another thread/process may occur at any time.  As we move
> > to more and more multi-threaded programming this synchronization (on
> > the client-side especially) becomes more and more painful.
> > 
> > 
> > ## Current status in Linux
> > 
> > Implicit synchronization in Linux works via a the kernel's internal
> > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > which represents the "done" status for some bit of work.  Typically,
> > dma_fences are created as a by-product of someone submitting some bit
> > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > set of dma_fences on it representing shared (read) and exclusive
> > (write) access to the object.  When work is submitted which, for
> > instance renders to the dma_buf, it's queued waiting on all the fences
> > on the dma_buf and and a dma_fence is created representing the end of
> > said rendering work and it's installed as the dma_buf's exclusive
> > fence.  This way, the kernel can manage all its internal queues (3D
> > rendering, display, video encode, etc.) and know which things to
> > submit in what order.
> > 
> > For the last few years, we've had sync_file in the kernel and it's
> > plumbed into some drivers.  A sync_file is just a wrapper around a
> > single dma_fence.  A sync_file is typically created as a by-product of
> > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > that work completes.  When a sync_file is created, it is guaranteed by
> > the kernel that it will become signaled in finite time and, once it's
> > signaled, it remains signaled for the rest of time.  A sync_file is
> > represented in UAPIs as a file descriptor and can be used with normal
> > file APIs such as dup().  It can be passed into another UAPI which
> > does some bit of queue'd work and the submitted work will wait for the
> > sync_file to be triggered before executing.  A sync_file also supports
> > poll() if  you want to wait on it manually.
> > 
> > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > drivers support it.  Here's a very quick overview of my understanding
> > of the status of various components (I don't know the status of
> > anything in the media world):
> > 
> >  - Vulkan: Explicit synchronization all the way but we have to go
> > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > to import/export sync_files to/from it's VkSemaphore and VkFence
> > synchronization primitives.
> >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > enable some forms of explicit sync via sync_file but OpenGL itself is
> > still implicit.
> >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > via EGL/OpenGL).  There is an unstable extension to allow passing
> > sync_files around but it's questionable how useful it is right now
> > (more on that later).
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
> >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > sync.
> >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > implicitly syncs sometimes due to it's internal memory residency
> > handling which can lead to over-synchronization.
> >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > explicit sync primitives.
> 
> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> are still in trouble but most Wayland compositors use atomic these
> days
> 
> >  - v4l: ???
> >  - gstreamer: ???
> >  - Media APIs such as vaapi etc.:  ???

GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
synchronisation is something we do already with GL (even if limited). We place
GLSync object in the pipeline and attach that on related GstBuffer. We wait on
these GLSync as late as possible (or superseed the sync if we queue more work
into the same GL context). That requires a special mode of operation of course.
We don't usually like making lazy blocking call implicit, as it tends to cause
random issues. If we need to wait, we think it's better to wait int he module
that is responsible, so in general, we try to negotiate and fallback locally
(it's plugin base, so this can be really messy otherwise).

So basically this problem needs to be solved in V4L2, VAAPI and other lower
level APIs first. We need API that provides us these fence (in or out), and then
we can consider using them. For V4L2, there was an attempt, but it was a bit of
a miss-fit. Your proposal could work, need to be tested I guess, but it does not
solve some of other issues that was discussed. Notably for camera capture, were
the HW timestamp is capture about at the same time the frame is ready. But the
timestamp is not part of the paylaod, so you need an entire API asynchronously
deliver that metadata. It's the biggest pain point I've found, such an API would
be quite invasive or if made really generic, might just never be adopted widely
enough.

There is other elements that would implement fencing, notably kmssink, but no
one actually dared porting it to atomic KMS, so clearly there is very little
comunity interest. glimagsink could clearly benifit. Right now if we import a
DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
uses), so then the operation will just fail where it worked before (breaking
userspace). If it was an explcit fence, we could handle that in GStreamer
cleanly as we do for new APIs.

> > 
> > 
> > ## Chicken and egg problems
> > 
> > Ok, this is where it starts getting depressing.  I made the claim
> > above that Wayland has an explicit synchronization protocol that's of
> > questionable usefulness.  I would claim that basically any bit of
> > plumbing we do through window systems is currently of questionable
> > usefulness.  Why?
> > 
> > From my perspective, as a Vulkan driver developer, I have to deal with
> > the fact that Vulkan is an explicit sync API but Wayland and X11
> > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > me because I can't really use it unless it's implemented in all of the
> > compositors.  Until every Wayland compositor I care about my users
> > being able to use (which is basically all of them) supports the
> > extension, I have to continue carry around my pile of hacks to keep
> > implicit sync and Vulkan working nicely together.
> > 
> > From the perspective of a Wayland compositor (I used to play in this
> > space), they'd love to implement the new explicit sync extension but
> > can't.  Sure, they could wire up the extension, but the moment they go
> > to flip a client buffer to the screen directly, they discover that KMS
> > doesn't support any explicit sync APIs.
> 
> As per the above correction, Wayland compositors aren't nearly as bad
> off as I initially thought.  There may still be weird screen capture
> cases but the normal cases of compositing and displaying via
> KMS/atomic should be in reasonably good shape.
> 
> > So, yes, they can technically
> > implement the extension assuming the EGL stack they're running on has
> > the sync_file extensions but any client buffers which come in using
> > the explicit sync Wayland extension have to be composited and can't be
> > scanned out directly.  As a 3D driver developer, I absolutely don't
> > want compositors doing that because my users will complain about
> > performance issues due to the extra blit.
> > 
> > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > all our problems, right?  It does, right up until someone decides that
> > they wan to screen capture their Wayland session via some hardware
> > media encoder that doesn't support explicit sync.  Now we have to
> > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > until they're guaranteed that they can use explicit sync when
> > displaying on X11 or Wayland.  Are you seeing the problem?
> > 
> > To make matters worse, since most things are doing implicit
> > synchronization today, it's really easy to get your explicit
> > synchronization wrong and never notice.  If you forget to pass a
> > sync_file into one place (say you never notice KMS doesn't support
> > them), it will probably work anyway thanks to all the implicit sync
> > that's going on elsewhere.
> > 
> > So, clearly, we all need to go write piles of code that we can't
> > actually properly test until everyone else has written their piece and
> > then we use explicit sync if and only if all components support it.
> > Really?  We're going to do multiple years of development and then just
> > hope it works when we finally flip the switch?  That doesn't sound
> > like a good plan to me.
> > 
> > 
> > ## A proposal: Implicit and explicit sync together
> > 
> > How to solve all these chicken-and-egg problems is something I've been
> > giving quite a bit of thought (and talking with many others about) in
> > the last couple of years.  One motivation for this is that we have to
> > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > becoming increasingly unhappy with the way that synchronization,
> > memory residency, and command submission are inherently intertwined in
> > i915 and would like to break things apart.  Towards that end, I have
> > an actual proposal.
> > 
> > A couple weeks ago, I sent a series of patches to the dri-devel
> > mailing list which adds a pair of new ioctls to dma-buf which allow
> > userspace to manually import or export a sync_file from a dma-buf.
> > The idea is that something like a Wayland compositor can switch to
> > 100% explicit sync internally once the ioctl is available.  If it gets
> > buffers in from a client that doesn't use the explicit sync extension,
> > it can pull a sync_file from the dma-buf and use that exactly as it
> > would a sync_file passed via the explicit sync extension.  When it
> > goes to scan out a user buffer and discovers that KMS doesn't accept
> > sync_files (or if it tries to use that pesky media encoder no one has
> > converted), it can take it's sync_file for display and stuff it into
> > the dma-buf before handing it to KMS.
> > 
> > Along with the kernel patches, I've also implemented support for this
> > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > only requirement on the Vulkan drivers is that you be able to export
> > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > driver only ever sees explicit synchronization via sync_file.  The WSI
> > code uses these new ioctls to translate the implicit sync of X11 and
> > Wayland to the explicit sync the Vulkan driver wants.
> > 
> > I'm hoping (and here's where I want a sanity check) that a simple API
> > like this will allow us to finally start moving the Linux ecosystem
> > over to explicit synchronization one piece at a time in a way that's
> > actually correct.  (No Wayland explicit sync with compositors hoping
> > KMS magically works even though it doesn't have a sync_file API.)
> > Once some pieces in the ecosystem start moving, there will be
> > motivation to start moving others and maybe we can actually build the
> > momentum to get most everything converted.
> > 
> > For reference, you can find the kernel RFC patches and mesa MR here:
> > 
> > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > 
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > 
> > At this point, I welcome your thoughts, comments, objections, and
> > maybe even help/review. :-)
> > 
> > --Jason Ekstrand

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 17:31 ` Jason Ekstrand
@ 2020-03-11 23:02   ` Adam Jackson
  -1 siblings, 0 replies; 101+ messages in thread
From: Adam Jackson @ 2020-03-11 23:02 UTC (permalink / raw)
  To: Jason Ekstrand, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:

>  - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.

I'm pretty sure "the only thing that fence does" is an implementation
detail. PresentPixmap blocks until the wait-fence signals, but when and
how it signals are properties of the fence itself. You could have drm
give the client back a fence fd, pass that to xserver to create a fence
object, and name that in the PresentPixmap request, and then drm can do
whatever it wants to signal the fence.

> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.

I'm quite sure we can give you an explicit-sync X11 API. I think you
may already have one.

- ajax


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-11 23:02   ` Adam Jackson
  0 siblings, 0 replies; 101+ messages in thread
From: Adam Jackson @ 2020-03-11 23:02 UTC (permalink / raw)
  To: Jason Ekstrand, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:

>  - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.

I'm pretty sure "the only thing that fence does" is an implementation
detail. PresentPixmap blocks until the wait-fence signals, but when and
how it signals are properties of the fence itself. You could have drm
give the client back a fence fd, pass that to xserver to create a fence
object, and name that in the PresentPixmap request, and then drm can do
whatever it wants to signal the fence.

> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.

I'm quite sure we can give you an explicit-sync X11 API. I think you
may already have one.

- ajax

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 23:02   ` Adam Jackson
@ 2020-03-12 15:46     ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-12 15:46 UTC (permalink / raw)
  To: Adam Jackson
  Cc: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

It seems I may have not set the tone I intended with this e-mail... My
intention was never to stomp on anyone's favorite window system (Adam,
isn't the only one who's seemed a bit miffed).  My intention was to
try and solve some very real problems that we have with Vulkan and I
had the hope that a solution there could be helpful for others.

The problem we have in Vulkan is that we have an inherently explicit
sync graphics API and we're trying to strap it onto some inherently
implicit sync window systems and kernel interfaces.  Our mechanisms
for doing so have evolved over the course of the last 4-5 years and
it's way better now than it was when we started but it's still pretty
bad and very invasive to the driver.  My objective is to completely
remove the concept of implicit sync from the Vulkan driver eventually.

Also (and this is going further down the rabbit hole), I would like to
begin cleaning up our i915 UAPI to better separate memory residency
handling, command submission, and synchronization.  Eventually (and
this may sound crazy to some), I'd like to get to the point where i915
doesn't own any of the synchronization primitives except what it needs
to handle memory management internally.  Linux graphics UAPI is about
10 years behind Windows in terms of design (roughly equivalent to
Win7) and I think it's costing us in terms of latency and CPU
overhead.  Some of that may just be implementation problems in i915;
some of it may be core API design.  It's a bit unclear.

Why am I bringing up kernel APIs?  Because one of the biggest problems
in evolving things is the fact that our kernel APIs are tied to
implicit sync on dma-buf.  We can't detangle that until we can remove
implicit dma-buf signaling from the command execution APIs.  This
means that we either need to get rid of ALL implicit synchronization
from window-system APIs far enough back in time that we don't run the
risk of "breaking userspace" or else we need a plan which lets the
kernel driver not support implicit sync but make implicit sync work
anyway.  What I'm proposing with dma-buf sync_file import/export is
one such plan.

So, while this may not solve any problems for Wayland compositors as I
previously thought (KMS/atomic supports sync_file.  Yay!), we still
have a very real problem in Vulkan.  It's great that Wayland has an
explicit sync API but until all compositors have supported it for at
least 2 years, I can't assume it's existence and start deleting my old
code paths.  Currently, it's only implemented in Weston and the
ChromeOS compositor; gnome-shell, kwin, and sway are all still 100%
implicit sync AFAIK.  We also have to deal with X11.

For those who are asking the question in the back of their minds:
Yes, I'm trying to solve a userspace problem with kernel code and, no,
I don't think that's necessarily the wrong way around.  Don't get me
wrong; I very much want to solve the problem "properly" but unless
we're very sure we can get it solved properly everywhere quickly, a
solution which lets us improve our driver kernel APIs independently of
misc. Wayland compositors seems advantageous.

On Wed, Mar 11, 2020 at 6:02 PM Adam Jackson <ajax@redhat.com> wrote:
>
> On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:
>
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
>
> I'm pretty sure "the only thing that fence does" is an implementation
> detail.

So I've been told, many times.

> PresentPixmap blocks until the wait-fence signals, but when and
> how it signals are properties of the fence itself. You could have drm
> give the client back a fence fd, pass that to xserver to create a fence
> object, and name that in the PresentPixmap request, and then drm can do
> whatever it wants to signal the fence.

Poking around at things, X11 may not be quite as bad as I thought
here.  It's not really set up for sync_file for a couple reasons:

 1. It only passes the file descriptor in once at
xcb_dri3_fence_from_fd rather than re-creating every frame from a new
sync_file
 2. It only takes a fence on present and doesn't return one in the
PRESENT_COMPLETE event

That said, plumbing syncobj in as an extension looks like a real
possibility.  A syncobj is just a container that holds a pointer to a
dma_fence and it has roughly the same CPU signal/reset behavior that's
exposed by the SyncFenceFuncsRec struct.  There's a few things I'm not
sure how to handle:

 1. The Sync extension has these trigger funcs which get called when
the fence is signalled.  I'm not sure how to handle that with syncobj
without a thread polling on them somehow.
 2. Not all kernel GPU drivers support syncobj; currently it's just
i915, amdgpu, and maybe freedreno AFAIK.  How do we handle cases such
as Intel+Nvidia?
 3. I have no idea what kinds of issues we'd run into with plumbing it
all through.  Hopefully, X is sufficiently abstracted but I really
don't know.

Please excuse my trepidation but I've got a bit of PTSD from
modifiers.  That was the last time I tried to solve a problem with
someone writing X11 patches and it's been 2-3 years and it's still not
shipping in distros.  If said syncobj extension suffers the same fate,
it isn't a real solution.

> > From my perspective, as a Vulkan driver developer, I have to deal with
> > the fact that Vulkan is an explicit sync API but Wayland and X11
> > aren't.
>
> I'm quite sure we can give you an explicit-sync X11 API. I think you
> may already have one.

It looks like we at least have a bunch of pieces which can probably be
used to build one.

--Jason

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-12 15:46     ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-12 15:46 UTC (permalink / raw)
  To: Adam Jackson
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

It seems I may have not set the tone I intended with this e-mail... My
intention was never to stomp on anyone's favorite window system (Adam,
isn't the only one who's seemed a bit miffed).  My intention was to
try and solve some very real problems that we have with Vulkan and I
had the hope that a solution there could be helpful for others.

The problem we have in Vulkan is that we have an inherently explicit
sync graphics API and we're trying to strap it onto some inherently
implicit sync window systems and kernel interfaces.  Our mechanisms
for doing so have evolved over the course of the last 4-5 years and
it's way better now than it was when we started but it's still pretty
bad and very invasive to the driver.  My objective is to completely
remove the concept of implicit sync from the Vulkan driver eventually.

Also (and this is going further down the rabbit hole), I would like to
begin cleaning up our i915 UAPI to better separate memory residency
handling, command submission, and synchronization.  Eventually (and
this may sound crazy to some), I'd like to get to the point where i915
doesn't own any of the synchronization primitives except what it needs
to handle memory management internally.  Linux graphics UAPI is about
10 years behind Windows in terms of design (roughly equivalent to
Win7) and I think it's costing us in terms of latency and CPU
overhead.  Some of that may just be implementation problems in i915;
some of it may be core API design.  It's a bit unclear.

Why am I bringing up kernel APIs?  Because one of the biggest problems
in evolving things is the fact that our kernel APIs are tied to
implicit sync on dma-buf.  We can't detangle that until we can remove
implicit dma-buf signaling from the command execution APIs.  This
means that we either need to get rid of ALL implicit synchronization
from window-system APIs far enough back in time that we don't run the
risk of "breaking userspace" or else we need a plan which lets the
kernel driver not support implicit sync but make implicit sync work
anyway.  What I'm proposing with dma-buf sync_file import/export is
one such plan.

So, while this may not solve any problems for Wayland compositors as I
previously thought (KMS/atomic supports sync_file.  Yay!), we still
have a very real problem in Vulkan.  It's great that Wayland has an
explicit sync API but until all compositors have supported it for at
least 2 years, I can't assume it's existence and start deleting my old
code paths.  Currently, it's only implemented in Weston and the
ChromeOS compositor; gnome-shell, kwin, and sway are all still 100%
implicit sync AFAIK.  We also have to deal with X11.

For those who are asking the question in the back of their minds:
Yes, I'm trying to solve a userspace problem with kernel code and, no,
I don't think that's necessarily the wrong way around.  Don't get me
wrong; I very much want to solve the problem "properly" but unless
we're very sure we can get it solved properly everywhere quickly, a
solution which lets us improve our driver kernel APIs independently of
misc. Wayland compositors seems advantageous.

On Wed, Mar 11, 2020 at 6:02 PM Adam Jackson <ajax@redhat.com> wrote:
>
> On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:
>
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
>
> I'm pretty sure "the only thing that fence does" is an implementation
> detail.

So I've been told, many times.

> PresentPixmap blocks until the wait-fence signals, but when and
> how it signals are properties of the fence itself. You could have drm
> give the client back a fence fd, pass that to xserver to create a fence
> object, and name that in the PresentPixmap request, and then drm can do
> whatever it wants to signal the fence.

Poking around at things, X11 may not be quite as bad as I thought
here.  It's not really set up for sync_file for a couple reasons:

 1. It only passes the file descriptor in once at
xcb_dri3_fence_from_fd rather than re-creating every frame from a new
sync_file
 2. It only takes a fence on present and doesn't return one in the
PRESENT_COMPLETE event

That said, plumbing syncobj in as an extension looks like a real
possibility.  A syncobj is just a container that holds a pointer to a
dma_fence and it has roughly the same CPU signal/reset behavior that's
exposed by the SyncFenceFuncsRec struct.  There's a few things I'm not
sure how to handle:

 1. The Sync extension has these trigger funcs which get called when
the fence is signalled.  I'm not sure how to handle that with syncobj
without a thread polling on them somehow.
 2. Not all kernel GPU drivers support syncobj; currently it's just
i915, amdgpu, and maybe freedreno AFAIK.  How do we handle cases such
as Intel+Nvidia?
 3. I have no idea what kinds of issues we'd run into with plumbing it
all through.  Hopefully, X is sufficiently abstracted but I really
don't know.

Please excuse my trepidation but I've got a bit of PTSD from
modifiers.  That was the last time I tried to solve a problem with
someone writing X11 patches and it's been 2-3 years and it's still not
shipping in distros.  If said syncobj extension suffers the same fate,
it isn't a real solution.

> > From my perspective, as a Vulkan driver developer, I have to deal with
> > the fact that Vulkan is an explicit sync API but Wayland and X11
> > aren't.
>
> I'm quite sure we can give you an explicit-sync X11 API. I think you
> may already have one.

It looks like we at least have a bunch of pieces which can probably be
used to build one.

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 17:31 ` Jason Ekstrand
@ 2020-03-13  1:37   ` Alexander E. Patrakov
  -1 siblings, 0 replies; 101+ messages in thread
From: Alexander E. Patrakov @ 2020-03-13  1:37 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, Linux Media Mailing List,
	Dave Airlie, Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Thu, Mar 12, 2020 at 6:36 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> From the perspective of a Wayland compositor (I used to play in this
> space), they'd love to implement the new explicit sync extension but
> can't.  Sure, they could wire up the extension, but the moment they go
> to flip a client buffer to the screen directly, they discover that KMS
> doesn't support any explicit sync APIs.  So, yes, they can technically
> implement the extension assuming the EGL stack they're running on has
> the sync_file extensions but any client buffers which come in using
> the explicit sync Wayland extension have to be composited and can't be
> scanned out directly.  As a 3D driver developer, I absolutely don't
> want compositors doing that because my users will complain about
> performance issues due to the extra blit.

<troll>
Maybe this is something for the Marketing Department to solve? Sell
the extra processing that can be done during such extra blit as a
feature?

As a former user of a wide-gamut monitor that has no sRGB mode, and a
gamer, I would definitely accept the extra step (color conversion, not
"just a blit"!) between the application and the actual output. In
fact, I have set up compicc just for this purpose. Games with
poisonous oversaturated colors (because none of the game authors care
about wide-gamut monitors) are worse than the same games affected by
the very small performance penalty due to the conversion.

We just need a Marketing Person to come up with a huge list of other
cases where such compositing step is required for correctness, and
declare that direct scanout is something that makes no sense in the
present day, except possibly on embedded devices.
</troll>

Of course the above trolling does not solve the problem related to
inability to be sure about the correct API usage.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-13  1:37   ` Alexander E. Patrakov
  0 siblings, 0 replies; 101+ messages in thread
From: Alexander E. Patrakov @ 2020-03-13  1:37 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Linux Media Mailing List

On Thu, Mar 12, 2020 at 6:36 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> From the perspective of a Wayland compositor (I used to play in this
> space), they'd love to implement the new explicit sync extension but
> can't.  Sure, they could wire up the extension, but the moment they go
> to flip a client buffer to the screen directly, they discover that KMS
> doesn't support any explicit sync APIs.  So, yes, they can technically
> implement the extension assuming the EGL stack they're running on has
> the sync_file extensions but any client buffers which come in using
> the explicit sync Wayland extension have to be composited and can't be
> scanned out directly.  As a 3D driver developer, I absolutely don't
> want compositors doing that because my users will complain about
> performance issues due to the extra blit.

<troll>
Maybe this is something for the Marketing Department to solve? Sell
the extra processing that can be done during such extra blit as a
feature?

As a former user of a wide-gamut monitor that has no sRGB mode, and a
gamer, I would definitely accept the extra step (color conversion, not
"just a blit"!) between the application and the actual output. In
fact, I have set up compicc just for this purpose. Games with
poisonous oversaturated colors (because none of the game authors care
about wide-gamut monitors) are worse than the same games affected by
the very small performance penalty due to the conversion.

We just need a Marketing Person to come up with a huge list of other
cases where such compositing step is required for correctness, and
declare that direct scanout is something that makes no sense in the
present day, except possibly on embedded devices.
</troll>

Of course the above trolling does not solve the problem related to
inability to be sure about the correct API usage.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 17:31 ` Jason Ekstrand
                   ` (3 preceding siblings ...)
  (?)
@ 2020-03-14  2:02 ` Marek Olšák
  2020-03-16  2:49   ` Jason Ekstrand
  -1 siblings, 1 reply; 101+ messages in thread
From: Marek Olšák @ 2020-03-14  2:02 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media


[-- Attachment #1.1: Type: text/plain, Size: 13238 bytes --]

There is no synchronization between processes (e.g. 3D app and compositor)
within X on AMD hw. It works because of some hacks in Mesa.

Marek

On Wed, Mar 11, 2020 at 1:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:

> All,
>
> Sorry for casting such a broad net with this one. I'm sure most people
> who reply will get at least one mailing list rejection.  However, this
> is an issue that affects a LOT of components and that's why it's
> thorny to begin with.  Please pardon the length of this e-mail as
> well; I promise there's a concrete point/proposal at the end.
>
>
> Explicit synchronization is the future of graphics and media.  At
> least, that seems to be the consensus among all the graphics people
> I've talked to.  I had a chat with one of the lead Android graphics
> engineers recently who told me that doing explicit sync from the start
> was one of the best engineering decisions Android ever made.  It's
> also the direction being taken by more modern APIs such as Vulkan.
>
>
> ## What are implicit and explicit synchronization?
>
> For those that aren't familiar with this space, GPUs, media encoders,
> etc. are massively parallel and synchronization of some form is
> required to ensure that everything happens in the right order and
> avoid data races.  Implicit synchronization is when bits of work (3D,
> compute, video encode, etc.) are implicitly based on the absolute
> CPU-time order in which API calls occur.  Explicit synchronization is
> when the client (whatever that means in any given context) provides
> the dependency graph explicitly via some sort of synchronization
> primitives.  If you're still confused, consider the following
> examples:
>
> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> two OpenGL contexts sharing an image where one writes to it and the
> other textures from it.  The way the OpenGL spec works, the client has
> to make the API calls to render to the image before (in CPU time) it
> makes the API calls which texture from the image.  As long as it does
> this (and maybe inserts a glFlush?), the driver will ensure that the
> rendering completes before the texturing happens and you get correct
> contents.
>
> Implicit synchronization can also happen across processes.  Wayland,
> for instance, is currently built on implicit sync where the client
> does their rendering and then does a hand-off (via wl_surface::commit)
> to tell the compositor it's done at which point the compositor can now
> texture from the surface.  The hand-off ensures that the client's
> OpenGL API calls happen before the server's OpenGL API calls.
>
> A good example of explicit synchronization is the Vulkan API.  There,
> a client (or multiple clients) can simultaneously build command
> buffers in different threads where one of those command buffers
> renders to an image and the other textures from it and then submit
> both of them at the same time with instructions to the driver for
> which order to execute them in.  The execution order is described via
> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> extension, you can even submit the work which does the texturing
> BEFORE the work which does the rendering and the driver will sort it
> out.
>
> The #1 problem with implicit synchronization (which explicit solves)
> is that it leads to a lot of over-synchronization both in client space
> and in driver/device space.  The client has to synchronize a lot more
> because it has to ensure that the API calls happen in a particular
> order.  The driver/device have to synchronize a lot more because they
> never know what is going to end up being a synchronization point as an
> API call on another thread/process may occur at any time.  As we move
> to more and more multi-threaded programming this synchronization (on
> the client-side especially) becomes more and more painful.
>
>
> ## Current status in Linux
>
> Implicit synchronization in Linux works via a the kernel's internal
> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> which represents the "done" status for some bit of work.  Typically,
> dma_fences are created as a by-product of someone submitting some bit
> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> set of dma_fences on it representing shared (read) and exclusive
> (write) access to the object.  When work is submitted which, for
> instance renders to the dma_buf, it's queued waiting on all the fences
> on the dma_buf and and a dma_fence is created representing the end of
> said rendering work and it's installed as the dma_buf's exclusive
> fence.  This way, the kernel can manage all its internal queues (3D
> rendering, display, video encode, etc.) and know which things to
> submit in what order.
>
> For the last few years, we've had sync_file in the kernel and it's
> plumbed into some drivers.  A sync_file is just a wrapper around a
> single dma_fence.  A sync_file is typically created as a by-product of
> submitting work (3D, compute, etc.) to the kernel and is signaled when
> that work completes.  When a sync_file is created, it is guaranteed by
> the kernel that it will become signaled in finite time and, once it's
> signaled, it remains signaled for the rest of time.  A sync_file is
> represented in UAPIs as a file descriptor and can be used with normal
> file APIs such as dup().  It can be passed into another UAPI which
> does some bit of queue'd work and the submitted work will wait for the
> sync_file to be triggered before executing.  A sync_file also supports
> poll() if  you want to wait on it manually.
>
> Unfortunately, sync_file is not broadly used and not all kernel GPU
> drivers support it.  Here's a very quick overview of my understanding
> of the status of various components (I don't know the status of
> anything in the media world):
>
>  - Vulkan: Explicit synchronization all the way but we have to go
> implicit as soon as we interact with a window-system.  Vulkan has APIs
> to import/export sync_files to/from it's VkSemaphore and VkFence
> synchronization primitives.
>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> enable some forms of explicit sync via sync_file but OpenGL itself is
> still implicit.
>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> via EGL/OpenGL).  There is an unstable extension to allow passing
> sync_files around but it's questionable how useful it is right now
> (more on that later).
>  - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.
>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> sync.
>  - linux/amdgpu: Supports sync_file and syncobj but it still
> implicitly syncs sometimes due to it's internal memory residency
> handling which can lead to over-synchronization.
>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> explicit sync primitives.
>  - v4l: ???
>  - gstreamer: ???
>  - Media APIs such as vaapi etc.:  ???
>
>
> ## Chicken and egg problems
>
> Ok, this is where it starts getting depressing.  I made the claim
> above that Wayland has an explicit synchronization protocol that's of
> questionable usefulness.  I would claim that basically any bit of
> plumbing we do through window systems is currently of questionable
> usefulness.  Why?
>
> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.  Unfortunately, the Wayland extension solves zero problems for
> me because I can't really use it unless it's implemented in all of the
> compositors.  Until every Wayland compositor I care about my users
> being able to use (which is basically all of them) supports the
> extension, I have to continue carry around my pile of hacks to keep
> implicit sync and Vulkan working nicely together.
>
> From the perspective of a Wayland compositor (I used to play in this
> space), they'd love to implement the new explicit sync extension but
> can't.  Sure, they could wire up the extension, but the moment they go
> to flip a client buffer to the screen directly, they discover that KMS
> doesn't support any explicit sync APIs.  So, yes, they can technically
> implement the extension assuming the EGL stack they're running on has
> the sync_file extensions but any client buffers which come in using
> the explicit sync Wayland extension have to be composited and can't be
> scanned out directly.  As a 3D driver developer, I absolutely don't
> want compositors doing that because my users will complain about
> performance issues due to the extra blit.
>
> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> all our problems, right?  It does, right up until someone decides that
> they wan to screen capture their Wayland session via some hardware
> media encoder that doesn't support explicit sync.  Now we have to
> plumb it all the way through the media stack, gstreamer, etc.  Great,
> so let's do that!  Oh, but gstreamer won't want to plumb it through
> until they're guaranteed that they can use explicit sync when
> displaying on X11 or Wayland.  Are you seeing the problem?
>
> To make matters worse, since most things are doing implicit
> synchronization today, it's really easy to get your explicit
> synchronization wrong and never notice.  If you forget to pass a
> sync_file into one place (say you never notice KMS doesn't support
> them), it will probably work anyway thanks to all the implicit sync
> that's going on elsewhere.
>
> So, clearly, we all need to go write piles of code that we can't
> actually properly test until everyone else has written their piece and
> then we use explicit sync if and only if all components support it.
> Really?  We're going to do multiple years of development and then just
> hope it works when we finally flip the switch?  That doesn't sound
> like a good plan to me.
>
>
> ## A proposal: Implicit and explicit sync together
>
> How to solve all these chicken-and-egg problems is something I've been
> giving quite a bit of thought (and talking with many others about) in
> the last couple of years.  One motivation for this is that we have to
> deal with a mismatch in Vulkan.  Another motivation is that I'm
> becoming increasingly unhappy with the way that synchronization,
> memory residency, and command submission are inherently intertwined in
> i915 and would like to break things apart.  Towards that end, I have
> an actual proposal.
>
> A couple weeks ago, I sent a series of patches to the dri-devel
> mailing list which adds a pair of new ioctls to dma-buf which allow
> userspace to manually import or export a sync_file from a dma-buf.
> The idea is that something like a Wayland compositor can switch to
> 100% explicit sync internally once the ioctl is available.  If it gets
> buffers in from a client that doesn't use the explicit sync extension,
> it can pull a sync_file from the dma-buf and use that exactly as it
> would a sync_file passed via the explicit sync extension.  When it
> goes to scan out a user buffer and discovers that KMS doesn't accept
> sync_files (or if it tries to use that pesky media encoder no one has
> converted), it can take it's sync_file for display and stuff it into
> the dma-buf before handing it to KMS.
>
> Along with the kernel patches, I've also implemented support for this
> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> only requirement on the Vulkan drivers is that you be able to export
> any VkSemaphore as a sync_file and temporarily import a sync_file into
> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> driver only ever sees explicit synchronization via sync_file.  The WSI
> code uses these new ioctls to translate the implicit sync of X11 and
> Wayland to the explicit sync the Vulkan driver wants.
>
> I'm hoping (and here's where I want a sanity check) that a simple API
> like this will allow us to finally start moving the Linux ecosystem
> over to explicit synchronization one piece at a time in a way that's
> actually correct.  (No Wayland explicit sync with compositors hoping
> KMS magically works even though it doesn't have a sync_file API.)
> Once some pieces in the ecosystem start moving, there will be
> motivation to start moving others and maybe we can actually build the
> momentum to get most everything converted.
>
> For reference, you can find the kernel RFC patches and mesa MR here:
>
> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
>
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
>
> At this point, I welcome your thoughts, comments, objections, and
> maybe even help/review. :-)
>
> --Jason Ekstrand
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>

[-- Attachment #1.2: Type: text/html, Size: 15023 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-14  2:02 ` [Mesa-dev] " Marek Olšák
@ 2020-03-16  2:49   ` Jason Ekstrand
  2020-03-16  3:50     ` Marek Olšák
  0 siblings, 1 reply; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16  2:49 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media


[-- Attachment #1.1: Type: text/plain, Size: 13609 bytes --]

Could you elaborate. If there's something missing from my mental model of 
how implicit sync works, I'd like to have it corrected. People continue 
claiming that AMD is somehow special but I have yet to grasp what makes it 
so.  (Not that anyone has bothered to try all that hard to explain it.)


--Jason

On March 13, 2020 21:03:21 Marek Olšák <maraeo@gmail.com> wrote:
> There is no synchronization between processes (e.g. 3D app and compositor) 
> within X on AMD hw. It works because of some hacks in Mesa.
>
> Marek
>
> On Wed, Mar 11, 2020 at 1:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> All,
>
> Sorry for casting such a broad net with this one. I'm sure most people
> who reply will get at least one mailing list rejection.  However, this
> is an issue that affects a LOT of components and that's why it's
> thorny to begin with.  Please pardon the length of this e-mail as
> well; I promise there's a concrete point/proposal at the end.
>
>
> Explicit synchronization is the future of graphics and media.  At
> least, that seems to be the consensus among all the graphics people
> I've talked to.  I had a chat with one of the lead Android graphics
> engineers recently who told me that doing explicit sync from the start
> was one of the best engineering decisions Android ever made.  It's
> also the direction being taken by more modern APIs such as Vulkan.
>
>
> ## What are implicit and explicit synchronization?
>
> For those that aren't familiar with this space, GPUs, media encoders,
> etc. are massively parallel and synchronization of some form is
> required to ensure that everything happens in the right order and
> avoid data races.  Implicit synchronization is when bits of work (3D,
> compute, video encode, etc.) are implicitly based on the absolute
> CPU-time order in which API calls occur.  Explicit synchronization is
> when the client (whatever that means in any given context) provides
> the dependency graph explicitly via some sort of synchronization
> primitives.  If you're still confused, consider the following
> examples:
>
> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> two OpenGL contexts sharing an image where one writes to it and the
> other textures from it.  The way the OpenGL spec works, the client has
> to make the API calls to render to the image before (in CPU time) it
> makes the API calls which texture from the image.  As long as it does
> this (and maybe inserts a glFlush?), the driver will ensure that the
> rendering completes before the texturing happens and you get correct
> contents.
>
> Implicit synchronization can also happen across processes.  Wayland,
> for instance, is currently built on implicit sync where the client
> does their rendering and then does a hand-off (via wl_surface::commit)
> to tell the compositor it's done at which point the compositor can now
> texture from the surface.  The hand-off ensures that the client's
> OpenGL API calls happen before the server's OpenGL API calls.
>
> A good example of explicit synchronization is the Vulkan API.  There,
> a client (or multiple clients) can simultaneously build command
> buffers in different threads where one of those command buffers
> renders to an image and the other textures from it and then submit
> both of them at the same time with instructions to the driver for
> which order to execute them in.  The execution order is described via
> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> extension, you can even submit the work which does the texturing
> BEFORE the work which does the rendering and the driver will sort it
> out.
>
> The #1 problem with implicit synchronization (which explicit solves)
> is that it leads to a lot of over-synchronization both in client space
> and in driver/device space.  The client has to synchronize a lot more
> because it has to ensure that the API calls happen in a particular
> order.  The driver/device have to synchronize a lot more because they
> never know what is going to end up being a synchronization point as an
> API call on another thread/process may occur at any time.  As we move
> to more and more multi-threaded programming this synchronization (on
> the client-side especially) becomes more and more painful.
>
>
> ## Current status in Linux
>
> Implicit synchronization in Linux works via a the kernel's internal
> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> which represents the "done" status for some bit of work.  Typically,
> dma_fences are created as a by-product of someone submitting some bit
> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> set of dma_fences on it representing shared (read) and exclusive
> (write) access to the object.  When work is submitted which, for
> instance renders to the dma_buf, it's queued waiting on all the fences
> on the dma_buf and and a dma_fence is created representing the end of
> said rendering work and it's installed as the dma_buf's exclusive
> fence.  This way, the kernel can manage all its internal queues (3D
> rendering, display, video encode, etc.) and know which things to
> submit in what order.
>
> For the last few years, we've had sync_file in the kernel and it's
> plumbed into some drivers.  A sync_file is just a wrapper around a
> single dma_fence.  A sync_file is typically created as a by-product of
> submitting work (3D, compute, etc.) to the kernel and is signaled when
> that work completes.  When a sync_file is created, it is guaranteed by
> the kernel that it will become signaled in finite time and, once it's
> signaled, it remains signaled for the rest of time.  A sync_file is
> represented in UAPIs as a file descriptor and can be used with normal
> file APIs such as dup().  It can be passed into another UAPI which
> does some bit of queue'd work and the submitted work will wait for the
> sync_file to be triggered before executing.  A sync_file also supports
> poll() if  you want to wait on it manually.
>
> Unfortunately, sync_file is not broadly used and not all kernel GPU
> drivers support it.  Here's a very quick overview of my understanding
> of the status of various components (I don't know the status of
> anything in the media world):
>
> - Vulkan: Explicit synchronization all the way but we have to go
> implicit as soon as we interact with a window-system.  Vulkan has APIs
> to import/export sync_files to/from it's VkSemaphore and VkFence
> synchronization primitives.
> - OpenGL: Implicit all the way.  There are some EGL extensions to
> enable some forms of explicit sync via sync_file but OpenGL itself is
> still implicit.
> - Wayland: Currently depends on implicit sync in the kernel (accessed
> via EGL/OpenGL).  There is an unstable extension to allow passing
> sync_files around but it's questionable how useful it is right now
> (more on that later).
> - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.
> - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> - linux/amdgpu: Supports sync_file and syncobj but it still
> implicitly syncs sometimes due to it's internal memory residency
> handling which can lead to over-synchronization.
> - KMS: Implicit sync all the way.  There are no KMS APIs which take
> explicit sync primitives.
> - v4l: ???
> - gstreamer: ???
> - Media APIs such as vaapi etc.:  ???
>
>
> ## Chicken and egg problems
>
> Ok, this is where it starts getting depressing.  I made the claim
> above that Wayland has an explicit synchronization protocol that's of
> questionable usefulness.  I would claim that basically any bit of
> plumbing we do through window systems is currently of questionable
> usefulness.  Why?
>
> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.  Unfortunately, the Wayland extension solves zero problems for
> me because I can't really use it unless it's implemented in all of the
> compositors.  Until every Wayland compositor I care about my users
> being able to use (which is basically all of them) supports the
> extension, I have to continue carry around my pile of hacks to keep
> implicit sync and Vulkan working nicely together.
>
> From the perspective of a Wayland compositor (I used to play in this
> space), they'd love to implement the new explicit sync extension but
> can't.  Sure, they could wire up the extension, but the moment they go
> to flip a client buffer to the screen directly, they discover that KMS
> doesn't support any explicit sync APIs.  So, yes, they can technically
> implement the extension assuming the EGL stack they're running on has
> the sync_file extensions but any client buffers which come in using
> the explicit sync Wayland extension have to be composited and can't be
> scanned out directly.  As a 3D driver developer, I absolutely don't
> want compositors doing that because my users will complain about
> performance issues due to the extra blit.
>
> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> all our problems, right?  It does, right up until someone decides that
> they wan to screen capture their Wayland session via some hardware
> media encoder that doesn't support explicit sync.  Now we have to
> plumb it all the way through the media stack, gstreamer, etc.  Great,
> so let's do that!  Oh, but gstreamer won't want to plumb it through
> until they're guaranteed that they can use explicit sync when
> displaying on X11 or Wayland.  Are you seeing the problem?
>
> To make matters worse, since most things are doing implicit
> synchronization today, it's really easy to get your explicit
> synchronization wrong and never notice.  If you forget to pass a
> sync_file into one place (say you never notice KMS doesn't support
> them), it will probably work anyway thanks to all the implicit sync
> that's going on elsewhere.
>
> So, clearly, we all need to go write piles of code that we can't
> actually properly test until everyone else has written their piece and
> then we use explicit sync if and only if all components support it.
> Really?  We're going to do multiple years of development and then just
> hope it works when we finally flip the switch?  That doesn't sound
> like a good plan to me.
>
>
> ## A proposal: Implicit and explicit sync together
>
> How to solve all these chicken-and-egg problems is something I've been
> giving quite a bit of thought (and talking with many others about) in
> the last couple of years.  One motivation for this is that we have to
> deal with a mismatch in Vulkan.  Another motivation is that I'm
> becoming increasingly unhappy with the way that synchronization,
> memory residency, and command submission are inherently intertwined in
> i915 and would like to break things apart.  Towards that end, I have
> an actual proposal.
>
> A couple weeks ago, I sent a series of patches to the dri-devel
> mailing list which adds a pair of new ioctls to dma-buf which allow
> userspace to manually import or export a sync_file from a dma-buf.
> The idea is that something like a Wayland compositor can switch to
> 100% explicit sync internally once the ioctl is available.  If it gets
> buffers in from a client that doesn't use the explicit sync extension,
> it can pull a sync_file from the dma-buf and use that exactly as it
> would a sync_file passed via the explicit sync extension.  When it
> goes to scan out a user buffer and discovers that KMS doesn't accept
> sync_files (or if it tries to use that pesky media encoder no one has
> converted), it can take it's sync_file for display and stuff it into
> the dma-buf before handing it to KMS.
>
> Along with the kernel patches, I've also implemented support for this
> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> only requirement on the Vulkan drivers is that you be able to export
> any VkSemaphore as a sync_file and temporarily import a sync_file into
> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> driver only ever sees explicit synchronization via sync_file.  The WSI
> code uses these new ioctls to translate the implicit sync of X11 and
> Wayland to the explicit sync the Vulkan driver wants.
>
> I'm hoping (and here's where I want a sanity check) that a simple API
> like this will allow us to finally start moving the Linux ecosystem
> over to explicit synchronization one piece at a time in a way that's
> actually correct.  (No Wayland explicit sync with compositors hoping
> KMS magically works even though it doesn't have a sync_file API.)
> Once some pieces in the ecosystem start moving, there will be
> motivation to start moving others and maybe we can actually build the
> momentum to get most everything converted.
>
> For reference, you can find the kernel RFC patches and mesa MR here:
>
> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
>
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
>
> At this point, I welcome your thoughts, comments, objections, and
> maybe even help/review. :-)
>
> --Jason Ekstrand
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[-- Attachment #1.2: Type: text/html, Size: 16171 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16  2:49   ` Jason Ekstrand
@ 2020-03-16  3:50     ` Marek Olšák
  2020-03-16  9:57         ` Michel Dänzer
  0 siblings, 1 reply; 101+ messages in thread
From: Marek Olšák @ 2020-03-16  3:50 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media


[-- Attachment #1.1: Type: text/plain, Size: 15522 bytes --]

The synchronization works because the Mesa driver waits for idle (drains
the GFX pipeline) at the end of command buffers and there is only 1
graphics queue, so everything is ordered.

The GFX pipeline runs asynchronously to the command buffer, meaning the
command buffer only starts draws and doesn't wait for completion. If the
Mesa driver didn't wait at the end of the command buffer, the command
buffer would finish and a different process could start execution of its
own command buffer while shaders of the previous process are still running.

If the Mesa driver submits a command buffer internally (because it's full),
it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
ended and a new one started.

The waiting at the end of command buffers happens only when the flush is
external (Swap buffers, glFlush).

It's a performance problem, because the GFX queue is blocked until the GFX
pipeline is drained at the end of every frame at least.

So explicit fences for SwapBuffers would help.

Marek

On Sun., Mar. 15, 2020, 22:49 Jason Ekstrand, <jason@jlekstrand.net> wrote:

> Could you elaborate. If there's something missing from my mental model of
> how implicit sync works, I'd like to have it corrected. People continue
> claiming that AMD is somehow special but I have yet to grasp what makes it
> so.  (Not that anyone has bothered to try all that hard to explain it.)
>
>
> --Jason
>
> On March 13, 2020 21:03:21 Marek Olšák <maraeo@gmail.com> wrote:
>
>> There is no synchronization between processes (e.g. 3D app and
>> compositor) within X on AMD hw. It works because of some hacks in Mesa.
>>
>> Marek
>>
>> On Wed, Mar 11, 2020 at 1:31 PM Jason Ekstrand <jason@jlekstrand.net>
>> wrote:
>>
>>> All,
>>>
>>> Sorry for casting such a broad net with this one. I'm sure most people
>>> who reply will get at least one mailing list rejection.  However, this
>>> is an issue that affects a LOT of components and that's why it's
>>> thorny to begin with.  Please pardon the length of this e-mail as
>>> well; I promise there's a concrete point/proposal at the end.
>>>
>>>
>>> Explicit synchronization is the future of graphics and media.  At
>>> least, that seems to be the consensus among all the graphics people
>>> I've talked to.  I had a chat with one of the lead Android graphics
>>> engineers recently who told me that doing explicit sync from the start
>>> was one of the best engineering decisions Android ever made.  It's
>>> also the direction being taken by more modern APIs such as Vulkan.
>>>
>>>
>>> ## What are implicit and explicit synchronization?
>>>
>>> For those that aren't familiar with this space, GPUs, media encoders,
>>> etc. are massively parallel and synchronization of some form is
>>> required to ensure that everything happens in the right order and
>>> avoid data races.  Implicit synchronization is when bits of work (3D,
>>> compute, video encode, etc.) are implicitly based on the absolute
>>> CPU-time order in which API calls occur.  Explicit synchronization is
>>> when the client (whatever that means in any given context) provides
>>> the dependency graph explicitly via some sort of synchronization
>>> primitives.  If you're still confused, consider the following
>>> examples:
>>>
>>> With OpenGL and EGL, almost everything is implicit sync.  Say you have
>>> two OpenGL contexts sharing an image where one writes to it and the
>>> other textures from it.  The way the OpenGL spec works, the client has
>>> to make the API calls to render to the image before (in CPU time) it
>>> makes the API calls which texture from the image.  As long as it does
>>> this (and maybe inserts a glFlush?), the driver will ensure that the
>>> rendering completes before the texturing happens and you get correct
>>> contents.
>>>
>>> Implicit synchronization can also happen across processes.  Wayland,
>>> for instance, is currently built on implicit sync where the client
>>> does their rendering and then does a hand-off (via wl_surface::commit)
>>> to tell the compositor it's done at which point the compositor can now
>>> texture from the surface.  The hand-off ensures that the client's
>>> OpenGL API calls happen before the server's OpenGL API calls.
>>>
>>> A good example of explicit synchronization is the Vulkan API.  There,
>>> a client (or multiple clients) can simultaneously build command
>>> buffers in different threads where one of those command buffers
>>> renders to an image and the other textures from it and then submit
>>> both of them at the same time with instructions to the driver for
>>> which order to execute them in.  The execution order is described via
>>> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
>>> extension, you can even submit the work which does the texturing
>>> BEFORE the work which does the rendering and the driver will sort it
>>> out.
>>>
>>> The #1 problem with implicit synchronization (which explicit solves)
>>> is that it leads to a lot of over-synchronization both in client space
>>> and in driver/device space.  The client has to synchronize a lot more
>>> because it has to ensure that the API calls happen in a particular
>>> order.  The driver/device have to synchronize a lot more because they
>>> never know what is going to end up being a synchronization point as an
>>> API call on another thread/process may occur at any time.  As we move
>>> to more and more multi-threaded programming this synchronization (on
>>> the client-side especially) becomes more and more painful.
>>>
>>>
>>> ## Current status in Linux
>>>
>>> Implicit synchronization in Linux works via a the kernel's internal
>>> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
>>> which represents the "done" status for some bit of work.  Typically,
>>> dma_fences are created as a by-product of someone submitting some bit
>>> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
>>> set of dma_fences on it representing shared (read) and exclusive
>>> (write) access to the object.  When work is submitted which, for
>>> instance renders to the dma_buf, it's queued waiting on all the fences
>>> on the dma_buf and and a dma_fence is created representing the end of
>>> said rendering work and it's installed as the dma_buf's exclusive
>>> fence.  This way, the kernel can manage all its internal queues (3D
>>> rendering, display, video encode, etc.) and know which things to
>>> submit in what order.
>>>
>>> For the last few years, we've had sync_file in the kernel and it's
>>> plumbed into some drivers.  A sync_file is just a wrapper around a
>>> single dma_fence.  A sync_file is typically created as a by-product of
>>> submitting work (3D, compute, etc.) to the kernel and is signaled when
>>> that work completes.  When a sync_file is created, it is guaranteed by
>>> the kernel that it will become signaled in finite time and, once it's
>>> signaled, it remains signaled for the rest of time.  A sync_file is
>>> represented in UAPIs as a file descriptor and can be used with normal
>>> file APIs such as dup().  It can be passed into another UAPI which
>>> does some bit of queue'd work and the submitted work will wait for the
>>> sync_file to be triggered before executing.  A sync_file also supports
>>> poll() if  you want to wait on it manually.
>>>
>>> Unfortunately, sync_file is not broadly used and not all kernel GPU
>>> drivers support it.  Here's a very quick overview of my understanding
>>> of the status of various components (I don't know the status of
>>> anything in the media world):
>>>
>>>  - Vulkan: Explicit synchronization all the way but we have to go
>>> implicit as soon as we interact with a window-system.  Vulkan has APIs
>>> to import/export sync_files to/from it's VkSemaphore and VkFence
>>> synchronization primitives.
>>>  - OpenGL: Implicit all the way.  There are some EGL extensions to
>>> enable some forms of explicit sync via sync_file but OpenGL itself is
>>> still implicit.
>>>  - Wayland: Currently depends on implicit sync in the kernel (accessed
>>> via EGL/OpenGL).  There is an unstable extension to allow passing
>>> sync_files around but it's questionable how useful it is right now
>>> (more on that later).
>>>  - X11: With present, it has these "explicit" fence objects but
>>> they're always a shmfence which lets the X server and client do a
>>> userspace CPU-side hand-off without going over the socket (and
>>> round-tripping through the kernel).  However, the only thing that
>>> fence does is order the OpenGL API calls in the client and server and
>>> the real synchronization is still implicit.
>>>  - linux/i915/gem: Fully supports using sync_file or syncobj for
>>> explicit sync.
>>>  - linux/amdgpu: Supports sync_file and syncobj but it still
>>> implicitly syncs sometimes due to it's internal memory residency
>>> handling which can lead to over-synchronization.
>>>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
>>> explicit sync primitives.
>>>  - v4l: ???
>>>  - gstreamer: ???
>>>  - Media APIs such as vaapi etc.:  ???
>>>
>>>
>>> ## Chicken and egg problems
>>>
>>> Ok, this is where it starts getting depressing.  I made the claim
>>> above that Wayland has an explicit synchronization protocol that's of
>>> questionable usefulness.  I would claim that basically any bit of
>>> plumbing we do through window systems is currently of questionable
>>> usefulness.  Why?
>>>
>>> From my perspective, as a Vulkan driver developer, I have to deal with
>>> the fact that Vulkan is an explicit sync API but Wayland and X11
>>> aren't.  Unfortunately, the Wayland extension solves zero problems for
>>> me because I can't really use it unless it's implemented in all of the
>>> compositors.  Until every Wayland compositor I care about my users
>>> being able to use (which is basically all of them) supports the
>>> extension, I have to continue carry around my pile of hacks to keep
>>> implicit sync and Vulkan working nicely together.
>>>
>>> From the perspective of a Wayland compositor (I used to play in this
>>> space), they'd love to implement the new explicit sync extension but
>>> can't.  Sure, they could wire up the extension, but the moment they go
>>> to flip a client buffer to the screen directly, they discover that KMS
>>> doesn't support any explicit sync APIs.  So, yes, they can technically
>>> implement the extension assuming the EGL stack they're running on has
>>> the sync_file extensions but any client buffers which come in using
>>> the explicit sync Wayland extension have to be composited and can't be
>>> scanned out directly.  As a 3D driver developer, I absolutely don't
>>> want compositors doing that because my users will complain about
>>> performance issues due to the extra blit.
>>>
>>> Ok, so let's say we get KMS wired up with implicit sync.  That solves
>>> all our problems, right?  It does, right up until someone decides that
>>> they wan to screen capture their Wayland session via some hardware
>>> media encoder that doesn't support explicit sync.  Now we have to
>>> plumb it all the way through the media stack, gstreamer, etc.  Great,
>>> so let's do that!  Oh, but gstreamer won't want to plumb it through
>>> until they're guaranteed that they can use explicit sync when
>>> displaying on X11 or Wayland.  Are you seeing the problem?
>>>
>>> To make matters worse, since most things are doing implicit
>>> synchronization today, it's really easy to get your explicit
>>> synchronization wrong and never notice.  If you forget to pass a
>>> sync_file into one place (say you never notice KMS doesn't support
>>> them), it will probably work anyway thanks to all the implicit sync
>>> that's going on elsewhere.
>>>
>>> So, clearly, we all need to go write piles of code that we can't
>>> actually properly test until everyone else has written their piece and
>>> then we use explicit sync if and only if all components support it.
>>> Really?  We're going to do multiple years of development and then just
>>> hope it works when we finally flip the switch?  That doesn't sound
>>> like a good plan to me.
>>>
>>>
>>> ## A proposal: Implicit and explicit sync together
>>>
>>> How to solve all these chicken-and-egg problems is something I've been
>>> giving quite a bit of thought (and talking with many others about) in
>>> the last couple of years.  One motivation for this is that we have to
>>> deal with a mismatch in Vulkan.  Another motivation is that I'm
>>> becoming increasingly unhappy with the way that synchronization,
>>> memory residency, and command submission are inherently intertwined in
>>> i915 and would like to break things apart.  Towards that end, I have
>>> an actual proposal.
>>>
>>> A couple weeks ago, I sent a series of patches to the dri-devel
>>> mailing list which adds a pair of new ioctls to dma-buf which allow
>>> userspace to manually import or export a sync_file from a dma-buf.
>>> The idea is that something like a Wayland compositor can switch to
>>> 100% explicit sync internally once the ioctl is available.  If it gets
>>> buffers in from a client that doesn't use the explicit sync extension,
>>> it can pull a sync_file from the dma-buf and use that exactly as it
>>> would a sync_file passed via the explicit sync extension.  When it
>>> goes to scan out a user buffer and discovers that KMS doesn't accept
>>> sync_files (or if it tries to use that pesky media encoder no one has
>>> converted), it can take it's sync_file for display and stuff it into
>>> the dma-buf before handing it to KMS.
>>>
>>> Along with the kernel patches, I've also implemented support for this
>>> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
>>> only requirement on the Vulkan drivers is that you be able to export
>>> any VkSemaphore as a sync_file and temporarily import a sync_file into
>>> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
>>> driver only ever sees explicit synchronization via sync_file.  The WSI
>>> code uses these new ioctls to translate the implicit sync of X11 and
>>> Wayland to the explicit sync the Vulkan driver wants.
>>>
>>> I'm hoping (and here's where I want a sanity check) that a simple API
>>> like this will allow us to finally start moving the Linux ecosystem
>>> over to explicit synchronization one piece at a time in a way that's
>>> actually correct.  (No Wayland explicit sync with compositors hoping
>>> KMS magically works even though it doesn't have a sync_file API.)
>>> Once some pieces in the ecosystem start moving, there will be
>>> motivation to start moving others and maybe we can actually build the
>>> momentum to get most everything converted.
>>>
>>> For reference, you can find the kernel RFC patches and mesa MR here:
>>>
>>> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
>>>
>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
>>>
>>> At this point, I welcome your thoughts, comments, objections, and
>>> maybe even help/review. :-)
>>>
>>> --Jason Ekstrand
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>
>

[-- Attachment #1.2: Type: text/html, Size: 17776 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16  3:50     ` Marek Olšák
@ 2020-03-16  9:57         ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-16  9:57 UTC (permalink / raw)
  To: Marek Olšák, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> The synchronization works because the Mesa driver waits for idle (drains
> the GFX pipeline) at the end of command buffers and there is only 1
> graphics queue, so everything is ordered.
> 
> The GFX pipeline runs asynchronously to the command buffer, meaning the
> command buffer only starts draws and doesn't wait for completion. If the
> Mesa driver didn't wait at the end of the command buffer, the command
> buffer would finish and a different process could start execution of its
> own command buffer while shaders of the previous process are still running.
> 
> If the Mesa driver submits a command buffer internally (because it's full),
> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> ended and a new one started.
> 
> The waiting at the end of command buffers happens only when the flush is
> external (Swap buffers, glFlush).
> 
> It's a performance problem, because the GFX queue is blocked until the GFX
> pipeline is drained at the end of every frame at least.
> 
> So explicit fences for SwapBuffers would help.

Not sure what difference it would make, since the same thing needs to be
done for explicit fences as well, doesn't it?


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16  9:57         ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-16  9:57 UTC (permalink / raw)
  To: Marek Olšák, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> The synchronization works because the Mesa driver waits for idle (drains
> the GFX pipeline) at the end of command buffers and there is only 1
> graphics queue, so everything is ordered.
> 
> The GFX pipeline runs asynchronously to the command buffer, meaning the
> command buffer only starts draws and doesn't wait for completion. If the
> Mesa driver didn't wait at the end of the command buffer, the command
> buffer would finish and a different process could start execution of its
> own command buffer while shaders of the previous process are still running.
> 
> If the Mesa driver submits a command buffer internally (because it's full),
> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> ended and a new one started.
> 
> The waiting at the end of command buffers happens only when the flush is
> external (Swap buffers, glFlush).
> 
> It's a performance problem, because the GFX queue is blocked until the GFX
> pipeline is drained at the end of every frame at least.
> 
> So explicit fences for SwapBuffers would help.

Not sure what difference it would make, since the same thing needs to be
done for explicit fences as well, doesn't it?


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 20:18     ` Nicolas Dufresne
@ 2020-03-16 10:20       ` Laurent Pinchart
  -1 siblings, 0 replies; 101+ messages in thread
From: Laurent Pinchart @ 2020-03-16 10:20 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Jason Ekstrand, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> (I know I'm going to be spammed by so many mailing list ...)
> 
> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > All,
> > > 
> > > Sorry for casting such a broad net with this one. I'm sure most people
> > > who reply will get at least one mailing list rejection.  However, this
> > > is an issue that affects a LOT of components and that's why it's
> > > thorny to begin with.  Please pardon the length of this e-mail as
> > > well; I promise there's a concrete point/proposal at the end.
> > > 
> > > 
> > > Explicit synchronization is the future of graphics and media.  At
> > > least, that seems to be the consensus among all the graphics people
> > > I've talked to.  I had a chat with one of the lead Android graphics
> > > engineers recently who told me that doing explicit sync from the start
> > > was one of the best engineering decisions Android ever made.  It's
> > > also the direction being taken by more modern APIs such as Vulkan.
> > > 
> > > 
> > > ## What are implicit and explicit synchronization?
> > > 
> > > For those that aren't familiar with this space, GPUs, media encoders,
> > > etc. are massively parallel and synchronization of some form is
> > > required to ensure that everything happens in the right order and
> > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > compute, video encode, etc.) are implicitly based on the absolute
> > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > when the client (whatever that means in any given context) provides
> > > the dependency graph explicitly via some sort of synchronization
> > > primitives.  If you're still confused, consider the following
> > > examples:
> > > 
> > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > two OpenGL contexts sharing an image where one writes to it and the
> > > other textures from it.  The way the OpenGL spec works, the client has
> > > to make the API calls to render to the image before (in CPU time) it
> > > makes the API calls which texture from the image.  As long as it does
> > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > rendering completes before the texturing happens and you get correct
> > > contents.
> > > 
> > > Implicit synchronization can also happen across processes.  Wayland,
> > > for instance, is currently built on implicit sync where the client
> > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > to tell the compositor it's done at which point the compositor can now
> > > texture from the surface.  The hand-off ensures that the client's
> > > OpenGL API calls happen before the server's OpenGL API calls.
> > > 
> > > A good example of explicit synchronization is the Vulkan API.  There,
> > > a client (or multiple clients) can simultaneously build command
> > > buffers in different threads where one of those command buffers
> > > renders to an image and the other textures from it and then submit
> > > both of them at the same time with instructions to the driver for
> > > which order to execute them in.  The execution order is described via
> > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > extension, you can even submit the work which does the texturing
> > > BEFORE the work which does the rendering and the driver will sort it
> > > out.
> > > 
> > > The #1 problem with implicit synchronization (which explicit solves)
> > > is that it leads to a lot of over-synchronization both in client space
> > > and in driver/device space.  The client has to synchronize a lot more
> > > because it has to ensure that the API calls happen in a particular
> > > order.  The driver/device have to synchronize a lot more because they
> > > never know what is going to end up being a synchronization point as an
> > > API call on another thread/process may occur at any time.  As we move
> > > to more and more multi-threaded programming this synchronization (on
> > > the client-side especially) becomes more and more painful.
> > > 
> > > 
> > > ## Current status in Linux
> > > 
> > > Implicit synchronization in Linux works via a the kernel's internal
> > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > which represents the "done" status for some bit of work.  Typically,
> > > dma_fences are created as a by-product of someone submitting some bit
> > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > set of dma_fences on it representing shared (read) and exclusive
> > > (write) access to the object.  When work is submitted which, for
> > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > on the dma_buf and and a dma_fence is created representing the end of
> > > said rendering work and it's installed as the dma_buf's exclusive
> > > fence.  This way, the kernel can manage all its internal queues (3D
> > > rendering, display, video encode, etc.) and know which things to
> > > submit in what order.
> > > 
> > > For the last few years, we've had sync_file in the kernel and it's
> > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > single dma_fence.  A sync_file is typically created as a by-product of
> > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > that work completes.  When a sync_file is created, it is guaranteed by
> > > the kernel that it will become signaled in finite time and, once it's
> > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > represented in UAPIs as a file descriptor and can be used with normal
> > > file APIs such as dup().  It can be passed into another UAPI which
> > > does some bit of queue'd work and the submitted work will wait for the
> > > sync_file to be triggered before executing.  A sync_file also supports
> > > poll() if  you want to wait on it manually.
> > > 
> > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > drivers support it.  Here's a very quick overview of my understanding
> > > of the status of various components (I don't know the status of
> > > anything in the media world):
> > > 
> > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > synchronization primitives.
> > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > still implicit.
> > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > sync_files around but it's questionable how useful it is right now
> > > (more on that later).
> > >  - X11: With present, it has these "explicit" fence objects but
> > > they're always a shmfence which lets the X server and client do a
> > > userspace CPU-side hand-off without going over the socket (and
> > > round-tripping through the kernel).  However, the only thing that
> > > fence does is order the OpenGL API calls in the client and server and
> > > the real synchronization is still implicit.
> > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > sync.
> > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > implicitly syncs sometimes due to it's internal memory residency
> > > handling which can lead to over-synchronization.
> > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > explicit sync primitives.
> > 
> > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > are still in trouble but most Wayland compositors use atomic these
> > days
> > 
> > >  - v4l: ???
> > >  - gstreamer: ???
> > >  - Media APIs such as vaapi etc.:  ???
> 
> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> synchronisation is something we do already with GL (even if limited). We place
> GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> these GLSync as late as possible (or superseed the sync if we queue more work
> into the same GL context). That requires a special mode of operation of course.
> We don't usually like making lazy blocking call implicit, as it tends to cause
> random issues. If we need to wait, we think it's better to wait int he module
> that is responsible, so in general, we try to negotiate and fallback locally
> (it's plugin base, so this can be really messy otherwise).
> 
> So basically this problem needs to be solved in V4L2, VAAPI and other lower
> level APIs first. We need API that provides us these fence (in or out), and then
> we can consider using them. For V4L2, there was an attempt, but it was a bit of
> a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> solve some of other issues that was discussed. Notably for camera capture, were
> the HW timestamp is capture about at the same time the frame is ready. But the
> timestamp is not part of the paylaod, so you need an entire API asynchronously
> deliver that metadata. It's the biggest pain point I've found, such an API would
> be quite invasive or if made really generic, might just never be adopted widely
> enough.

Another issue is that V4L2 doesn't offer any guarantee on job ordering.
When you queue multiple buffers for camera capture for instance, you
don't know until capture complete in which buffer the frame has been
captured. In the normal case buffers are processed in sequence, but if
an error occurs during capture, they can be recycled internally and put
to the back of the queue. Unless I'm mistaken, this problem also exists
with stateful codecs. And if you don't know in advance which buffer you
will receive from the device, the usefulness of fences becomes very
questionable :-)

> There is other elements that would implement fencing, notably kmssink, but no
> one actually dared porting it to atomic KMS, so clearly there is very little
> comunity interest. glimagsink could clearly benifit. Right now if we import a
> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> uses), so then the operation will just fail where it worked before (breaking
> userspace). If it was an explcit fence, we could handle that in GStreamer
> cleanly as we do for new APIs.
> 
> > > ## Chicken and egg problems
> > > 
> > > Ok, this is where it starts getting depressing.  I made the claim
> > > above that Wayland has an explicit synchronization protocol that's of
> > > questionable usefulness.  I would claim that basically any bit of
> > > plumbing we do through window systems is currently of questionable
> > > usefulness.  Why?
> > > 
> > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > me because I can't really use it unless it's implemented in all of the
> > > compositors.  Until every Wayland compositor I care about my users
> > > being able to use (which is basically all of them) supports the
> > > extension, I have to continue carry around my pile of hacks to keep
> > > implicit sync and Vulkan working nicely together.
> > > 
> > > From the perspective of a Wayland compositor (I used to play in this
> > > space), they'd love to implement the new explicit sync extension but
> > > can't.  Sure, they could wire up the extension, but the moment they go
> > > to flip a client buffer to the screen directly, they discover that KMS
> > > doesn't support any explicit sync APIs.
> > 
> > As per the above correction, Wayland compositors aren't nearly as bad
> > off as I initially thought.  There may still be weird screen capture
> > cases but the normal cases of compositing and displaying via
> > KMS/atomic should be in reasonably good shape.
> > 
> > > So, yes, they can technically
> > > implement the extension assuming the EGL stack they're running on has
> > > the sync_file extensions but any client buffers which come in using
> > > the explicit sync Wayland extension have to be composited and can't be
> > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > want compositors doing that because my users will complain about
> > > performance issues due to the extra blit.
> > > 
> > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > all our problems, right?  It does, right up until someone decides that
> > > they wan to screen capture their Wayland session via some hardware
> > > media encoder that doesn't support explicit sync.  Now we have to
> > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > until they're guaranteed that they can use explicit sync when
> > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > 
> > > To make matters worse, since most things are doing implicit
> > > synchronization today, it's really easy to get your explicit
> > > synchronization wrong and never notice.  If you forget to pass a
> > > sync_file into one place (say you never notice KMS doesn't support
> > > them), it will probably work anyway thanks to all the implicit sync
> > > that's going on elsewhere.
> > > 
> > > So, clearly, we all need to go write piles of code that we can't
> > > actually properly test until everyone else has written their piece and
> > > then we use explicit sync if and only if all components support it.
> > > Really?  We're going to do multiple years of development and then just
> > > hope it works when we finally flip the switch?  That doesn't sound
> > > like a good plan to me.
> > > 
> > > 
> > > ## A proposal: Implicit and explicit sync together
> > > 
> > > How to solve all these chicken-and-egg problems is something I've been
> > > giving quite a bit of thought (and talking with many others about) in
> > > the last couple of years.  One motivation for this is that we have to
> > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > becoming increasingly unhappy with the way that synchronization,
> > > memory residency, and command submission are inherently intertwined in
> > > i915 and would like to break things apart.  Towards that end, I have
> > > an actual proposal.
> > > 
> > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > userspace to manually import or export a sync_file from a dma-buf.
> > > The idea is that something like a Wayland compositor can switch to
> > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > buffers in from a client that doesn't use the explicit sync extension,
> > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > would a sync_file passed via the explicit sync extension.  When it
> > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > sync_files (or if it tries to use that pesky media encoder no one has
> > > converted), it can take it's sync_file for display and stuff it into
> > > the dma-buf before handing it to KMS.
> > > 
> > > Along with the kernel patches, I've also implemented support for this
> > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > only requirement on the Vulkan drivers is that you be able to export
> > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > code uses these new ioctls to translate the implicit sync of X11 and
> > > Wayland to the explicit sync the Vulkan driver wants.
> > > 
> > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > like this will allow us to finally start moving the Linux ecosystem
> > > over to explicit synchronization one piece at a time in a way that's
> > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > KMS magically works even though it doesn't have a sync_file API.)
> > > Once some pieces in the ecosystem start moving, there will be
> > > motivation to start moving others and maybe we can actually build the
> > > momentum to get most everything converted.
> > > 
> > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > 
> > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > 
> > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > 
> > > At this point, I welcome your thoughts, comments, objections, and
> > > maybe even help/review. :-)
> > > 
> > > --Jason Ekstrand
> 

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 10:20       ` Laurent Pinchart
  0 siblings, 0 replies; 101+ messages in thread
From: Laurent Pinchart @ 2020-03-16 10:20 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> (I know I'm going to be spammed by so many mailing list ...)
> 
> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > All,
> > > 
> > > Sorry for casting such a broad net with this one. I'm sure most people
> > > who reply will get at least one mailing list rejection.  However, this
> > > is an issue that affects a LOT of components and that's why it's
> > > thorny to begin with.  Please pardon the length of this e-mail as
> > > well; I promise there's a concrete point/proposal at the end.
> > > 
> > > 
> > > Explicit synchronization is the future of graphics and media.  At
> > > least, that seems to be the consensus among all the graphics people
> > > I've talked to.  I had a chat with one of the lead Android graphics
> > > engineers recently who told me that doing explicit sync from the start
> > > was one of the best engineering decisions Android ever made.  It's
> > > also the direction being taken by more modern APIs such as Vulkan.
> > > 
> > > 
> > > ## What are implicit and explicit synchronization?
> > > 
> > > For those that aren't familiar with this space, GPUs, media encoders,
> > > etc. are massively parallel and synchronization of some form is
> > > required to ensure that everything happens in the right order and
> > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > compute, video encode, etc.) are implicitly based on the absolute
> > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > when the client (whatever that means in any given context) provides
> > > the dependency graph explicitly via some sort of synchronization
> > > primitives.  If you're still confused, consider the following
> > > examples:
> > > 
> > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > two OpenGL contexts sharing an image where one writes to it and the
> > > other textures from it.  The way the OpenGL spec works, the client has
> > > to make the API calls to render to the image before (in CPU time) it
> > > makes the API calls which texture from the image.  As long as it does
> > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > rendering completes before the texturing happens and you get correct
> > > contents.
> > > 
> > > Implicit synchronization can also happen across processes.  Wayland,
> > > for instance, is currently built on implicit sync where the client
> > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > to tell the compositor it's done at which point the compositor can now
> > > texture from the surface.  The hand-off ensures that the client's
> > > OpenGL API calls happen before the server's OpenGL API calls.
> > > 
> > > A good example of explicit synchronization is the Vulkan API.  There,
> > > a client (or multiple clients) can simultaneously build command
> > > buffers in different threads where one of those command buffers
> > > renders to an image and the other textures from it and then submit
> > > both of them at the same time with instructions to the driver for
> > > which order to execute them in.  The execution order is described via
> > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > extension, you can even submit the work which does the texturing
> > > BEFORE the work which does the rendering and the driver will sort it
> > > out.
> > > 
> > > The #1 problem with implicit synchronization (which explicit solves)
> > > is that it leads to a lot of over-synchronization both in client space
> > > and in driver/device space.  The client has to synchronize a lot more
> > > because it has to ensure that the API calls happen in a particular
> > > order.  The driver/device have to synchronize a lot more because they
> > > never know what is going to end up being a synchronization point as an
> > > API call on another thread/process may occur at any time.  As we move
> > > to more and more multi-threaded programming this synchronization (on
> > > the client-side especially) becomes more and more painful.
> > > 
> > > 
> > > ## Current status in Linux
> > > 
> > > Implicit synchronization in Linux works via a the kernel's internal
> > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > which represents the "done" status for some bit of work.  Typically,
> > > dma_fences are created as a by-product of someone submitting some bit
> > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > set of dma_fences on it representing shared (read) and exclusive
> > > (write) access to the object.  When work is submitted which, for
> > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > on the dma_buf and and a dma_fence is created representing the end of
> > > said rendering work and it's installed as the dma_buf's exclusive
> > > fence.  This way, the kernel can manage all its internal queues (3D
> > > rendering, display, video encode, etc.) and know which things to
> > > submit in what order.
> > > 
> > > For the last few years, we've had sync_file in the kernel and it's
> > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > single dma_fence.  A sync_file is typically created as a by-product of
> > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > that work completes.  When a sync_file is created, it is guaranteed by
> > > the kernel that it will become signaled in finite time and, once it's
> > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > represented in UAPIs as a file descriptor and can be used with normal
> > > file APIs such as dup().  It can be passed into another UAPI which
> > > does some bit of queue'd work and the submitted work will wait for the
> > > sync_file to be triggered before executing.  A sync_file also supports
> > > poll() if  you want to wait on it manually.
> > > 
> > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > drivers support it.  Here's a very quick overview of my understanding
> > > of the status of various components (I don't know the status of
> > > anything in the media world):
> > > 
> > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > synchronization primitives.
> > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > still implicit.
> > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > sync_files around but it's questionable how useful it is right now
> > > (more on that later).
> > >  - X11: With present, it has these "explicit" fence objects but
> > > they're always a shmfence which lets the X server and client do a
> > > userspace CPU-side hand-off without going over the socket (and
> > > round-tripping through the kernel).  However, the only thing that
> > > fence does is order the OpenGL API calls in the client and server and
> > > the real synchronization is still implicit.
> > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > sync.
> > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > implicitly syncs sometimes due to it's internal memory residency
> > > handling which can lead to over-synchronization.
> > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > explicit sync primitives.
> > 
> > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > are still in trouble but most Wayland compositors use atomic these
> > days
> > 
> > >  - v4l: ???
> > >  - gstreamer: ???
> > >  - Media APIs such as vaapi etc.:  ???
> 
> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> synchronisation is something we do already with GL (even if limited). We place
> GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> these GLSync as late as possible (or superseed the sync if we queue more work
> into the same GL context). That requires a special mode of operation of course.
> We don't usually like making lazy blocking call implicit, as it tends to cause
> random issues. If we need to wait, we think it's better to wait int he module
> that is responsible, so in general, we try to negotiate and fallback locally
> (it's plugin base, so this can be really messy otherwise).
> 
> So basically this problem needs to be solved in V4L2, VAAPI and other lower
> level APIs first. We need API that provides us these fence (in or out), and then
> we can consider using them. For V4L2, there was an attempt, but it was a bit of
> a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> solve some of other issues that was discussed. Notably for camera capture, were
> the HW timestamp is capture about at the same time the frame is ready. But the
> timestamp is not part of the paylaod, so you need an entire API asynchronously
> deliver that metadata. It's the biggest pain point I've found, such an API would
> be quite invasive or if made really generic, might just never be adopted widely
> enough.

Another issue is that V4L2 doesn't offer any guarantee on job ordering.
When you queue multiple buffers for camera capture for instance, you
don't know until capture complete in which buffer the frame has been
captured. In the normal case buffers are processed in sequence, but if
an error occurs during capture, they can be recycled internally and put
to the back of the queue. Unless I'm mistaken, this problem also exists
with stateful codecs. And if you don't know in advance which buffer you
will receive from the device, the usefulness of fences becomes very
questionable :-)

> There is other elements that would implement fencing, notably kmssink, but no
> one actually dared porting it to atomic KMS, so clearly there is very little
> comunity interest. glimagsink could clearly benifit. Right now if we import a
> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> uses), so then the operation will just fail where it worked before (breaking
> userspace). If it was an explcit fence, we could handle that in GStreamer
> cleanly as we do for new APIs.
> 
> > > ## Chicken and egg problems
> > > 
> > > Ok, this is where it starts getting depressing.  I made the claim
> > > above that Wayland has an explicit synchronization protocol that's of
> > > questionable usefulness.  I would claim that basically any bit of
> > > plumbing we do through window systems is currently of questionable
> > > usefulness.  Why?
> > > 
> > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > me because I can't really use it unless it's implemented in all of the
> > > compositors.  Until every Wayland compositor I care about my users
> > > being able to use (which is basically all of them) supports the
> > > extension, I have to continue carry around my pile of hacks to keep
> > > implicit sync and Vulkan working nicely together.
> > > 
> > > From the perspective of a Wayland compositor (I used to play in this
> > > space), they'd love to implement the new explicit sync extension but
> > > can't.  Sure, they could wire up the extension, but the moment they go
> > > to flip a client buffer to the screen directly, they discover that KMS
> > > doesn't support any explicit sync APIs.
> > 
> > As per the above correction, Wayland compositors aren't nearly as bad
> > off as I initially thought.  There may still be weird screen capture
> > cases but the normal cases of compositing and displaying via
> > KMS/atomic should be in reasonably good shape.
> > 
> > > So, yes, they can technically
> > > implement the extension assuming the EGL stack they're running on has
> > > the sync_file extensions but any client buffers which come in using
> > > the explicit sync Wayland extension have to be composited and can't be
> > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > want compositors doing that because my users will complain about
> > > performance issues due to the extra blit.
> > > 
> > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > all our problems, right?  It does, right up until someone decides that
> > > they wan to screen capture their Wayland session via some hardware
> > > media encoder that doesn't support explicit sync.  Now we have to
> > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > until they're guaranteed that they can use explicit sync when
> > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > 
> > > To make matters worse, since most things are doing implicit
> > > synchronization today, it's really easy to get your explicit
> > > synchronization wrong and never notice.  If you forget to pass a
> > > sync_file into one place (say you never notice KMS doesn't support
> > > them), it will probably work anyway thanks to all the implicit sync
> > > that's going on elsewhere.
> > > 
> > > So, clearly, we all need to go write piles of code that we can't
> > > actually properly test until everyone else has written their piece and
> > > then we use explicit sync if and only if all components support it.
> > > Really?  We're going to do multiple years of development and then just
> > > hope it works when we finally flip the switch?  That doesn't sound
> > > like a good plan to me.
> > > 
> > > 
> > > ## A proposal: Implicit and explicit sync together
> > > 
> > > How to solve all these chicken-and-egg problems is something I've been
> > > giving quite a bit of thought (and talking with many others about) in
> > > the last couple of years.  One motivation for this is that we have to
> > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > becoming increasingly unhappy with the way that synchronization,
> > > memory residency, and command submission are inherently intertwined in
> > > i915 and would like to break things apart.  Towards that end, I have
> > > an actual proposal.
> > > 
> > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > userspace to manually import or export a sync_file from a dma-buf.
> > > The idea is that something like a Wayland compositor can switch to
> > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > buffers in from a client that doesn't use the explicit sync extension,
> > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > would a sync_file passed via the explicit sync extension.  When it
> > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > sync_files (or if it tries to use that pesky media encoder no one has
> > > converted), it can take it's sync_file for display and stuff it into
> > > the dma-buf before handing it to KMS.
> > > 
> > > Along with the kernel patches, I've also implemented support for this
> > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > only requirement on the Vulkan drivers is that you be able to export
> > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > code uses these new ioctls to translate the implicit sync of X11 and
> > > Wayland to the explicit sync the Vulkan driver wants.
> > > 
> > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > like this will allow us to finally start moving the Linux ecosystem
> > > over to explicit synchronization one piece at a time in a way that's
> > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > KMS magically works even though it doesn't have a sync_file API.)
> > > Once some pieces in the ecosystem start moving, there will be
> > > motivation to start moving others and maybe we can actually build the
> > > momentum to get most everything converted.
> > > 
> > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > 
> > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > 
> > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > 
> > > At this point, I welcome your thoughts, comments, objections, and
> > > maybe even help/review. :-)
> > > 
> > > --Jason Ekstrand
> 

-- 
Regards,

Laurent Pinchart
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 10:20       ` Laurent Pinchart
  (?)
@ 2020-03-16 12:55       ` Tomek Bury
  2020-03-16 13:01           ` Laurent Pinchart
  2020-03-16 14:19           ` Daniel Stone
  -1 siblings, 2 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 12:55 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne, linux-media


[-- Attachment #1.1: Type: text/plain, Size: 21305 bytes --]

Hi Jason,

I've been wrestling with the sync problems in Wayland some time ago, but
only with regards to 3D drivers.

The guarantee given by the GL/GLES spec is limited to a single graphics
context. If the same buffer is accessed by 2 contexts the outcome is
unspecified. The cross-context and cross-process synchronisation is not
guaranteed. It happens to work on Mesa, because the read/write locking is
implemented in the kernel space, but it didn't work on Broadcom driver,
which has read-write interlocks in user space.

 A Vulkan client makes it even worse because of conflicting requirements:
Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows
waiting. Wayland WSI requires wl_surface_commit() to be called from
vkQueuePresentKHR() which does require a wait, unless a synchronisation
primitive representing Vulkan samaphores is passed between Vulkan client
and the compositor.

The most troublesome part was Wayland buffer release mechanism, as it only
involves a CPU signalling over Wayland IPC, without any 3D driver
involvement. The choices were: explicit synchronisation extension or a
buffer copy in the compositor (i.e. compositor textures from the copy, so
the client can re-write the original), or some implicit synchronisation in
kernel space (but that wasn't an option in Broadcom driver).

With regards to V4L2, I believe it could easily work the same way as 3D
drivers, i.e. pass a buffer+fence pair to the next stage. The encode always
succeeds, but for capture or decode, the main problem is the uncertain
outcome, I believe? If we're fine with rendering or displaying an
occasional broken frame, then buffer+fence pair would work too. The broken
frame will go into the pipeline, but application can drain the pipeline and
start over once the capture works again.

To answer some points raised by Laurent (although I'm unfamiliar with the
camera drivers):

> you don't know until capture complete in which buffer the frame has
been captured
Surely you do, you only don't know in advance if the capture will be
successful

> but if an error occurs during capture, they can be recycled internally
and put to the back of the queue.
That would have to change in order to use explicit synchronisation. Every
started capture becomes immediately available as a buffer+fence pair. Fence
is signalled once the capture is finished (successfully or otherwise). The
buffer must not be reused until it's released, possibly with another fence
- in that case the buffer must not be reused until the release fence is
signalled.

Cheers,
Tomek

On Mon, 16 Mar 2020 at 10:20, Laurent Pinchart <
laurent.pinchart@ideasonboard.com> wrote:

> On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > (I know I'm going to be spammed by so many mailing list ...)
> >
> > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net>
> wrote:
> > > > All,
> > > >
> > > > Sorry for casting such a broad net with this one. I'm sure most
> people
> > > > who reply will get at least one mailing list rejection.  However,
> this
> > > > is an issue that affects a LOT of components and that's why it's
> > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > well; I promise there's a concrete point/proposal at the end.
> > > >
> > > >
> > > > Explicit synchronization is the future of graphics and media.  At
> > > > least, that seems to be the consensus among all the graphics people
> > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > engineers recently who told me that doing explicit sync from the
> start
> > > > was one of the best engineering decisions Android ever made.  It's
> > > > also the direction being taken by more modern APIs such as Vulkan.
> > > >
> > > >
> > > > ## What are implicit and explicit synchronization?
> > > >
> > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > etc. are massively parallel and synchronization of some form is
> > > > required to ensure that everything happens in the right order and
> > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > when the client (whatever that means in any given context) provides
> > > > the dependency graph explicitly via some sort of synchronization
> > > > primitives.  If you're still confused, consider the following
> > > > examples:
> > > >
> > > > With OpenGL and EGL, almost everything is implicit sync.  Say you
> have
> > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > other textures from it.  The way the OpenGL spec works, the client
> has
> > > > to make the API calls to render to the image before (in CPU time) it
> > > > makes the API calls which texture from the image.  As long as it does
> > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > rendering completes before the texturing happens and you get correct
> > > > contents.
> > > >
> > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > for instance, is currently built on implicit sync where the client
> > > > does their rendering and then does a hand-off (via
> wl_surface::commit)
> > > > to tell the compositor it's done at which point the compositor can
> now
> > > > texture from the surface.  The hand-off ensures that the client's
> > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > >
> > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > a client (or multiple clients) can simultaneously build command
> > > > buffers in different threads where one of those command buffers
> > > > renders to an image and the other textures from it and then submit
> > > > both of them at the same time with instructions to the driver for
> > > > which order to execute them in.  The execution order is described via
> > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > extension, you can even submit the work which does the texturing
> > > > BEFORE the work which does the rendering and the driver will sort it
> > > > out.
> > > >
> > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > is that it leads to a lot of over-synchronization both in client
> space
> > > > and in driver/device space.  The client has to synchronize a lot more
> > > > because it has to ensure that the API calls happen in a particular
> > > > order.  The driver/device have to synchronize a lot more because they
> > > > never know what is going to end up being a synchronization point as
> an
> > > > API call on another thread/process may occur at any time.  As we move
> > > > to more and more multi-threaded programming this synchronization (on
> > > > the client-side especially) becomes more and more painful.
> > > >
> > > >
> > > > ## Current status in Linux
> > > >
> > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > which represents the "done" status for some bit of work.  Typically,
> > > > dma_fences are created as a by-product of someone submitting some bit
> > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > set of dma_fences on it representing shared (read) and exclusive
> > > > (write) access to the object.  When work is submitted which, for
> > > > instance renders to the dma_buf, it's queued waiting on all the
> fences
> > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > rendering, display, video encode, etc.) and know which things to
> > > > submit in what order.
> > > >
> > > > For the last few years, we've had sync_file in the kernel and it's
> > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > single dma_fence.  A sync_file is typically created as a by-product
> of
> > > > submitting work (3D, compute, etc.) to the kernel and is signaled
> when
> > > > that work completes.  When a sync_file is created, it is guaranteed
> by
> > > > the kernel that it will become signaled in finite time and, once it's
> > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > does some bit of queue'd work and the submitted work will wait for
> the
> > > > sync_file to be triggered before executing.  A sync_file also
> supports
> > > > poll() if  you want to wait on it manually.
> > > >
> > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > drivers support it.  Here's a very quick overview of my understanding
> > > > of the status of various components (I don't know the status of
> > > > anything in the media world):
> > > >
> > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > implicit as soon as we interact with a window-system.  Vulkan has
> APIs
> > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > synchronization primitives.
> > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > still implicit.
> > > >  - Wayland: Currently depends on implicit sync in the kernel
> (accessed
> > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > sync_files around but it's questionable how useful it is right now
> > > > (more on that later).
> > > >  - X11: With present, it has these "explicit" fence objects but
> > > > they're always a shmfence which lets the X server and client do a
> > > > userspace CPU-side hand-off without going over the socket (and
> > > > round-tripping through the kernel).  However, the only thing that
> > > > fence does is order the OpenGL API calls in the client and server and
> > > > the real synchronization is still implicit.
> > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for
> explicit
> > > > sync.
> > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > implicitly syncs sometimes due to it's internal memory residency
> > > > handling which can lead to over-synchronization.
> > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > explicit sync primitives.
> > >
> > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > are still in trouble but most Wayland compositors use atomic these
> > > days
> > >
> > > >  - v4l: ???
> > > >  - gstreamer: ???
> > > >  - Media APIs such as vaapi etc.:  ???
> >
> > GStreamer is consumer for V4L2, VAAPI and other stuff. Using
> asynchronous buffer
> > synchronisation is something we do already with GL (even if limited). We
> place
> > GLSync object in the pipeline and attach that on related GstBuffer. We
> wait on
> > these GLSync as late as possible (or superseed the sync if we queue more
> work
> > into the same GL context). That requires a special mode of operation of
> course.
> > We don't usually like making lazy blocking call implicit, as it tends to
> cause
> > random issues. If we need to wait, we think it's better to wait int he
> module
> > that is responsible, so in general, we try to negotiate and fallback
> locally
> > (it's plugin base, so this can be really messy otherwise).
> >
> > So basically this problem needs to be solved in V4L2, VAAPI and other
> lower
> > level APIs first. We need API that provides us these fence (in or out),
> and then
> > we can consider using them. For V4L2, there was an attempt, but it was a
> bit of
> > a miss-fit. Your proposal could work, need to be tested I guess, but it
> does not
> > solve some of other issues that was discussed. Notably for camera
> capture, were
> > the HW timestamp is capture about at the same time the frame is ready.
> But the
> > timestamp is not part of the paylaod, so you need an entire API
> asynchronously
> > deliver that metadata. It's the biggest pain point I've found, such an
> API would
> > be quite invasive or if made really generic, might just never be adopted
> widely
> > enough.
>
> Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> When you queue multiple buffers for camera capture for instance, you
> don't know until capture complete in which buffer the frame has been
> captured. In the normal case buffers are processed in sequence, but if
> an error occurs during capture, they can be recycled internally and put
> to the back of the queue. Unless I'm mistaken, this problem also exists
> with stateful codecs. And if you don't know in advance which buffer you
> will receive from the device, the usefulness of fences becomes very
> questionable :-)
>
> > There is other elements that would implement fencing, notably kmssink,
> but no
> > one actually dared porting it to atomic KMS, so clearly there is very
> little
> > comunity interest. glimagsink could clearly benifit. Right now if we
> import a
> > DMABuf, and that this DMAbuf is used for render, a implicit fence is
> attached,
> > which we are unaware. Philippe Zabbel is working on a patch, so V4L2
> QBUF would
> > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which
> GStreamer
> > uses), so then the operation will just fail where it worked before
> (breaking
> > userspace). If it was an explcit fence, we could handle that in GStreamer
> > cleanly as we do for new APIs.
> >
> > > > ## Chicken and egg problems
> > > >
> > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > above that Wayland has an explicit synchronization protocol that's of
> > > > questionable usefulness.  I would claim that basically any bit of
> > > > plumbing we do through window systems is currently of questionable
> > > > usefulness.  Why?
> > > >
> > > > From my perspective, as a Vulkan driver developer, I have to deal
> with
> > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > aren't.  Unfortunately, the Wayland extension solves zero problems
> for
> > > > me because I can't really use it unless it's implemented in all of
> the
> > > > compositors.  Until every Wayland compositor I care about my users
> > > > being able to use (which is basically all of them) supports the
> > > > extension, I have to continue carry around my pile of hacks to keep
> > > > implicit sync and Vulkan working nicely together.
> > > >
> > > > From the perspective of a Wayland compositor (I used to play in this
> > > > space), they'd love to implement the new explicit sync extension but
> > > > can't.  Sure, they could wire up the extension, but the moment they
> go
> > > > to flip a client buffer to the screen directly, they discover that
> KMS
> > > > doesn't support any explicit sync APIs.
> > >
> > > As per the above correction, Wayland compositors aren't nearly as bad
> > > off as I initially thought.  There may still be weird screen capture
> > > cases but the normal cases of compositing and displaying via
> > > KMS/atomic should be in reasonably good shape.
> > >
> > > > So, yes, they can technically
> > > > implement the extension assuming the EGL stack they're running on has
> > > > the sync_file extensions but any client buffers which come in using
> > > > the explicit sync Wayland extension have to be composited and can't
> be
> > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > want compositors doing that because my users will complain about
> > > > performance issues due to the extra blit.
> > > >
> > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > all our problems, right?  It does, right up until someone decides
> that
> > > > they wan to screen capture their Wayland session via some hardware
> > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > until they're guaranteed that they can use explicit sync when
> > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > >
> > > > To make matters worse, since most things are doing implicit
> > > > synchronization today, it's really easy to get your explicit
> > > > synchronization wrong and never notice.  If you forget to pass a
> > > > sync_file into one place (say you never notice KMS doesn't support
> > > > them), it will probably work anyway thanks to all the implicit sync
> > > > that's going on elsewhere.
> > > >
> > > > So, clearly, we all need to go write piles of code that we can't
> > > > actually properly test until everyone else has written their piece
> and
> > > > then we use explicit sync if and only if all components support it.
> > > > Really?  We're going to do multiple years of development and then
> just
> > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > like a good plan to me.
> > > >
> > > >
> > > > ## A proposal: Implicit and explicit sync together
> > > >
> > > > How to solve all these chicken-and-egg problems is something I've
> been
> > > > giving quite a bit of thought (and talking with many others about) in
> > > > the last couple of years.  One motivation for this is that we have to
> > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > becoming increasingly unhappy with the way that synchronization,
> > > > memory residency, and command submission are inherently intertwined
> in
> > > > i915 and would like to break things apart.  Towards that end, I have
> > > > an actual proposal.
> > > >
> > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > The idea is that something like a Wayland compositor can switch to
> > > > 100% explicit sync internally once the ioctl is available.  If it
> gets
> > > > buffers in from a client that doesn't use the explicit sync
> extension,
> > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > would a sync_file passed via the explicit sync extension.  When it
> > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > converted), it can take it's sync_file for display and stuff it into
> > > > the dma-buf before handing it to KMS.
> > > >
> > > > Along with the kernel patches, I've also implemented support for this
> > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > only requirement on the Vulkan drivers is that you be able to export
> > > > any VkSemaphore as a sync_file and temporarily import a sync_file
> into
> > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > driver only ever sees explicit synchronization via sync_file.  The
> WSI
> > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > Wayland to the explicit sync the Vulkan driver wants.
> > > >
> > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > like this will allow us to finally start moving the Linux ecosystem
> > > > over to explicit synchronization one piece at a time in a way that's
> > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > Once some pieces in the ecosystem start moving, there will be
> > > > motivation to start moving others and maybe we can actually build the
> > > > momentum to get most everything converted.
> > > >
> > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > >
> > > >
> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > >
> > > > At this point, I welcome your thoughts, comments, objections, and
> > > > maybe even help/review. :-)
> > > >
> > > > --Jason Ekstrand
> >
>
> --
> Regards,
>
> Laurent Pinchart
> _______________________________________________
> wayland-devel mailing list
> wayland-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/wayland-devel
>

[-- Attachment #1.2: Type: text/html, Size: 25491 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 12:55       ` Tomek Bury
@ 2020-03-16 13:01           ` Laurent Pinchart
  2020-03-16 14:19           ` Daniel Stone
  1 sibling, 0 replies; 101+ messages in thread
From: Laurent Pinchart @ 2020-03-16 13:01 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Daniel Stone,
	Dave Airlie, linux-media

Hi Tomek,

On Mon, Mar 16, 2020 at 12:55:27PM +0000, Tomek Bury wrote:
> Hi Jason,
> 
> I've been wrestling with the sync problems in Wayland some time ago, but only
> with regards to 3D drivers.
> 
> The guarantee given by the GL/GLES spec is limited to a single graphics
> context. If the same buffer is accessed by 2 contexts the outcome is
> unspecified. The cross-context and cross-process synchronisation is not
> guaranteed. It happens to work on Mesa, because the read/write locking is
> implemented in the kernel space, but it didn't work on Broadcom driver, which
> has read-write interlocks in user space.
> 
>  A Vulkan client makes it even worse because of conflicting requirements:
> Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows
> waiting. Wayland WSI requires wl_surface_commit() to be called from
> vkQueuePresentKHR() which does require a wait, unless a synchronisation
> primitive representing Vulkan samaphores is passed between Vulkan client and
> the compositor.
> 
> The most troublesome part was Wayland buffer release mechanism, as it only
> involves a CPU signalling over Wayland IPC, without any 3D driver involvement.
> The choices were: explicit synchronisation extension or a buffer copy in the
> compositor (i.e. compositor textures from the copy, so the client can re-write
> the original), or some implicit synchronisation in kernel space (but that
> wasn't an option in Broadcom driver).
> 
> With regards to V4L2, I believe it could easily work the same way as 3D
> drivers, i.e. pass a buffer+fence pair to the next stage. The encode always
> succeeds, but for capture or decode, the main problem is the uncertain outcome,
> I believe? If we're fine with rendering or displaying an occasional broken
> frame, then buffer+fence pair would work too. The broken frame will go into the
> pipeline, but application can drain the pipeline and start over once the
> capture works again.
> 
> To answer some points raised by Laurent (although I'm unfamiliar with the
> camera drivers):
> 
> > you don't know until capture complete in which buffer the frame has
> > been captured
>
> Surely you do, you only don't know in advance if the capture will be successful

You do in kernelspace, but not in userspace at the moment, due to buffer
recycling.

> > but if an error occurs during capture, they can be recycled internally and
> > put to the back of the queue.
>
> That would have to change in order to use explicit synchronisation. Every
> started capture becomes immediately available as a buffer+fence pair. Fence is
> signalled once the capture is finished (successfully or otherwise). The buffer
> must not be reused until it's released, possibly with another fence - in that
> case the buffer must not be reused until the release fence is signalled.

We could certainly change this at least in some cases, but it would
break existing userspace that doesn't expect incorrect frames.

I'm however not sure we could change this behaviour in every case, there
may be hardware that can't provide a guarantee on the order in which
buffers will be used. I'm aware this wouldn't be compatible with
explicit synchronization, and that's my point: camera hardware may not
always support explicit synchronization. As long as we can fall back to
not using fences then we should be fine.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 13:01           ` Laurent Pinchart
  0 siblings, 0 replies; 101+ messages in thread
From: Laurent Pinchart @ 2020-03-16 13:01 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne, linux-media

Hi Tomek,

On Mon, Mar 16, 2020 at 12:55:27PM +0000, Tomek Bury wrote:
> Hi Jason,
> 
> I've been wrestling with the sync problems in Wayland some time ago, but only
> with regards to 3D drivers.
> 
> The guarantee given by the GL/GLES spec is limited to a single graphics
> context. If the same buffer is accessed by 2 contexts the outcome is
> unspecified. The cross-context and cross-process synchronisation is not
> guaranteed. It happens to work on Mesa, because the read/write locking is
> implemented in the kernel space, but it didn't work on Broadcom driver, which
> has read-write interlocks in user space.
> 
>  A Vulkan client makes it even worse because of conflicting requirements:
> Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows
> waiting. Wayland WSI requires wl_surface_commit() to be called from
> vkQueuePresentKHR() which does require a wait, unless a synchronisation
> primitive representing Vulkan samaphores is passed between Vulkan client and
> the compositor.
> 
> The most troublesome part was Wayland buffer release mechanism, as it only
> involves a CPU signalling over Wayland IPC, without any 3D driver involvement.
> The choices were: explicit synchronisation extension or a buffer copy in the
> compositor (i.e. compositor textures from the copy, so the client can re-write
> the original), or some implicit synchronisation in kernel space (but that
> wasn't an option in Broadcom driver).
> 
> With regards to V4L2, I believe it could easily work the same way as 3D
> drivers, i.e. pass a buffer+fence pair to the next stage. The encode always
> succeeds, but for capture or decode, the main problem is the uncertain outcome,
> I believe? If we're fine with rendering or displaying an occasional broken
> frame, then buffer+fence pair would work too. The broken frame will go into the
> pipeline, but application can drain the pipeline and start over once the
> capture works again.
> 
> To answer some points raised by Laurent (although I'm unfamiliar with the
> camera drivers):
> 
> > you don't know until capture complete in which buffer the frame has
> > been captured
>
> Surely you do, you only don't know in advance if the capture will be successful

You do in kernelspace, but not in userspace at the moment, due to buffer
recycling.

> > but if an error occurs during capture, they can be recycled internally and
> > put to the back of the queue.
>
> That would have to change in order to use explicit synchronisation. Every
> started capture becomes immediately available as a buffer+fence pair. Fence is
> signalled once the capture is finished (successfully or otherwise). The buffer
> must not be reused until it's released, possibly with another fence - in that
> case the buffer must not be reused until the release fence is signalled.

We could certainly change this at least in some cases, but it would
break existing userspace that doesn't expect incorrect frames.

I'm however not sure we could change this behaviour in every case, there
may be hardware that can't provide a guarantee on the order in which
buffers will be used. I'm aware this wouldn't be compatible with
explicit synchronization, and that's my point: camera hardware may not
always support explicit synchronization. As long as we can fall back to
not using fences then we should be fine.

-- 
Regards,

Laurent Pinchart
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 13:01           ` Laurent Pinchart
@ 2020-03-16 13:34             ` Tomek Bury
  -1 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 13:34 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Daniel Stone,
	Dave Airlie, linux-media

>  As long as we can fall back to not using fences then we should be fine.
Buffers written by the camera are trivial because you control what
happens - just don't attach fence, so that the capture can be used
immediately. For recycled buffers there's an extra bit of work to do
because won't  be up to camera driver to decide whether the buffer
comes back with or without fence.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 13:34             ` Tomek Bury
  0 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 13:34 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne, linux-media

>  As long as we can fall back to not using fences then we should be fine.
Buffers written by the camera are trivial because you control what
happens - just don't attach fence, so that the capture can be used
immediately. For recycled buffers there's an extra bit of work to do
because won't  be up to camera driver to decide whether the buffer
comes back with or without fence.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 12:55       ` Tomek Bury
@ 2020-03-16 14:19           ` Daniel Stone
  2020-03-16 14:19           ` Daniel Stone
  1 sibling, 0 replies; 101+ messages in thread
From: Daniel Stone @ 2020-03-16 14:19 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Laurent Pinchart, Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Dave Airlie,
	open list:DMA BUFFER SHARING FRAMEWORK

Hi Tomek,

On Mon, 16 Mar 2020 at 12:55, Tomek Bury <tomek.bury@gmail.com> wrote:
> I've been wrestling with the sync problems in Wayland some time ago, but only with regards to 3D drivers.
>
> The guarantee given by the GL/GLES spec is limited to a single graphics context. If the same buffer is accessed by 2 contexts the outcome is unspecified. The cross-context and cross-process synchronisation is not guaranteed. It happens to work on Mesa, because the read/write locking is implemented in the kernel space, but it didn't work on Broadcom driver, which has read-write interlocks in user space.

GL and GLES are not relevant. What is relevant is EGL, which defines
interfaces to make things work on the native platform. EGL doesn't
define any kind of synchronisation model for the Wayland, X11, or
GBM/KMS platforms - but it's one of the things which has to work. It
doesn't say that the implementation must make sure that the requested
format is displayable, but you sort of take it for granted that if you
ask EGL to display something it will do so.

Synchronisation is one of those mechanisms which is left to the
platform to implement under the hood. In the absence of platform
support for explicit synchronisation, the synchronisation must be
implicit.

>  A Vulkan client makes it even worse because of conflicting requirements: Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows waiting. Wayland WSI requires wl_surface_commit() to be called from vkQueuePresentKHR() which does require a wait, unless a synchronisation primitive representing Vulkan samaphores is passed between Vulkan client and the compositor.

If you are using EGL_WL_bind_wayland_display, then one of the things
it is explicitly allowed/expected to do is to create a Wayland
protocol interface between client and compositor, which can be used to
pass buffer handles and metadata in a platform-specific way. Adding
synchronisation is also possible.

> The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).

You can add your own explicit synchronisation extension.

In every cross-process and cross-subsystem usecase, synchronisation is
obviously required. The two options for this are to implement kernel
support for implicit synchronisation (as everyone else has done), or
implement generic support for explicit synchronisation (as we have
been working on with implementations inside Weston and Exosphere at
least), or implement private support for explicit synchronisation, or
do nothing and then be surprised at the lack of synchronisation.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 14:19           ` Daniel Stone
  0 siblings, 0 replies; 101+ messages in thread
From: Daniel Stone @ 2020-03-16 14:19 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

Hi Tomek,

On Mon, 16 Mar 2020 at 12:55, Tomek Bury <tomek.bury@gmail.com> wrote:
> I've been wrestling with the sync problems in Wayland some time ago, but only with regards to 3D drivers.
>
> The guarantee given by the GL/GLES spec is limited to a single graphics context. If the same buffer is accessed by 2 contexts the outcome is unspecified. The cross-context and cross-process synchronisation is not guaranteed. It happens to work on Mesa, because the read/write locking is implemented in the kernel space, but it didn't work on Broadcom driver, which has read-write interlocks in user space.

GL and GLES are not relevant. What is relevant is EGL, which defines
interfaces to make things work on the native platform. EGL doesn't
define any kind of synchronisation model for the Wayland, X11, or
GBM/KMS platforms - but it's one of the things which has to work. It
doesn't say that the implementation must make sure that the requested
format is displayable, but you sort of take it for granted that if you
ask EGL to display something it will do so.

Synchronisation is one of those mechanisms which is left to the
platform to implement under the hood. In the absence of platform
support for explicit synchronisation, the synchronisation must be
implicit.

>  A Vulkan client makes it even worse because of conflicting requirements: Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows waiting. Wayland WSI requires wl_surface_commit() to be called from vkQueuePresentKHR() which does require a wait, unless a synchronisation primitive representing Vulkan samaphores is passed between Vulkan client and the compositor.

If you are using EGL_WL_bind_wayland_display, then one of the things
it is explicitly allowed/expected to do is to create a Wayland
protocol interface between client and compositor, which can be used to
pass buffer handles and metadata in a platform-specific way. Adding
synchronisation is also possible.

> The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).

You can add your own explicit synchronisation extension.

In every cross-process and cross-subsystem usecase, synchronisation is
obviously required. The two options for this are to implement kernel
support for implicit synchronisation (as everyone else has done), or
implement generic support for explicit synchronisation (as we have
been working on with implementations inside Weston and Exosphere at
least), or implement private support for explicit synchronisation, or
do nothing and then be surprised at the lack of synchronisation.

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 10:20       ` Laurent Pinchart
@ 2020-03-16 15:06         ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16 15:06 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Nicolas Dufresne, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > (I know I'm going to be spammed by so many mailing list ...)
> >
> > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > All,
> > > >
> > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > who reply will get at least one mailing list rejection.  However, this
> > > > is an issue that affects a LOT of components and that's why it's
> > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > well; I promise there's a concrete point/proposal at the end.
> > > >
> > > >
> > > > Explicit synchronization is the future of graphics and media.  At
> > > > least, that seems to be the consensus among all the graphics people
> > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > engineers recently who told me that doing explicit sync from the start
> > > > was one of the best engineering decisions Android ever made.  It's
> > > > also the direction being taken by more modern APIs such as Vulkan.
> > > >
> > > >
> > > > ## What are implicit and explicit synchronization?
> > > >
> > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > etc. are massively parallel and synchronization of some form is
> > > > required to ensure that everything happens in the right order and
> > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > when the client (whatever that means in any given context) provides
> > > > the dependency graph explicitly via some sort of synchronization
> > > > primitives.  If you're still confused, consider the following
> > > > examples:
> > > >
> > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > to make the API calls to render to the image before (in CPU time) it
> > > > makes the API calls which texture from the image.  As long as it does
> > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > rendering completes before the texturing happens and you get correct
> > > > contents.
> > > >
> > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > for instance, is currently built on implicit sync where the client
> > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > to tell the compositor it's done at which point the compositor can now
> > > > texture from the surface.  The hand-off ensures that the client's
> > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > >
> > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > a client (or multiple clients) can simultaneously build command
> > > > buffers in different threads where one of those command buffers
> > > > renders to an image and the other textures from it and then submit
> > > > both of them at the same time with instructions to the driver for
> > > > which order to execute them in.  The execution order is described via
> > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > extension, you can even submit the work which does the texturing
> > > > BEFORE the work which does the rendering and the driver will sort it
> > > > out.
> > > >
> > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > is that it leads to a lot of over-synchronization both in client space
> > > > and in driver/device space.  The client has to synchronize a lot more
> > > > because it has to ensure that the API calls happen in a particular
> > > > order.  The driver/device have to synchronize a lot more because they
> > > > never know what is going to end up being a synchronization point as an
> > > > API call on another thread/process may occur at any time.  As we move
> > > > to more and more multi-threaded programming this synchronization (on
> > > > the client-side especially) becomes more and more painful.
> > > >
> > > >
> > > > ## Current status in Linux
> > > >
> > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > which represents the "done" status for some bit of work.  Typically,
> > > > dma_fences are created as a by-product of someone submitting some bit
> > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > set of dma_fences on it representing shared (read) and exclusive
> > > > (write) access to the object.  When work is submitted which, for
> > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > rendering, display, video encode, etc.) and know which things to
> > > > submit in what order.
> > > >
> > > > For the last few years, we've had sync_file in the kernel and it's
> > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > the kernel that it will become signaled in finite time and, once it's
> > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > does some bit of queue'd work and the submitted work will wait for the
> > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > poll() if  you want to wait on it manually.
> > > >
> > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > drivers support it.  Here's a very quick overview of my understanding
> > > > of the status of various components (I don't know the status of
> > > > anything in the media world):
> > > >
> > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > synchronization primitives.
> > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > still implicit.
> > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > sync_files around but it's questionable how useful it is right now
> > > > (more on that later).
> > > >  - X11: With present, it has these "explicit" fence objects but
> > > > they're always a shmfence which lets the X server and client do a
> > > > userspace CPU-side hand-off without going over the socket (and
> > > > round-tripping through the kernel).  However, the only thing that
> > > > fence does is order the OpenGL API calls in the client and server and
> > > > the real synchronization is still implicit.
> > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > sync.
> > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > implicitly syncs sometimes due to it's internal memory residency
> > > > handling which can lead to over-synchronization.
> > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > explicit sync primitives.
> > >
> > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > are still in trouble but most Wayland compositors use atomic these
> > > days
> > >
> > > >  - v4l: ???
> > > >  - gstreamer: ???
> > > >  - Media APIs such as vaapi etc.:  ???
> >
> > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > synchronisation is something we do already with GL (even if limited). We place
> > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > these GLSync as late as possible (or superseed the sync if we queue more work
> > into the same GL context). That requires a special mode of operation of course.
> > We don't usually like making lazy blocking call implicit, as it tends to cause
> > random issues. If we need to wait, we think it's better to wait int he module
> > that is responsible, so in general, we try to negotiate and fallback locally
> > (it's plugin base, so this can be really messy otherwise).
> >
> > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > level APIs first. We need API that provides us these fence (in or out), and then
> > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > solve some of other issues that was discussed. Notably for camera capture, were
> > the HW timestamp is capture about at the same time the frame is ready. But the
> > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > deliver that metadata. It's the biggest pain point I've found, such an API would
> > be quite invasive or if made really generic, might just never be adopted widely
> > enough.
>
> Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> When you queue multiple buffers for camera capture for instance, you
> don't know until capture complete in which buffer the frame has been
> captured.

Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
start of frame capture which buffer it's getting written into.  I
would think that the kernel APIs could be adjusted (if we find good
reason to do so!) such that they return earlier and return a (buffer,
fence) pair.  Am I missing something fundamental about video here?

I must admit that V4L is a bit of an odd case since the kernel driver
is the producer and not the consumer.

> In the normal case buffers are processed in sequence, but if
> an error occurs during capture, they can be recycled internally and put
> to the back of the queue.

Are those errors something that can happen at any time in the middle
of a frame capture?  If so, that does make things stickier.

> Unless I'm mistaken, this problem also exists
> with stateful codecs. And if you don't know in advance which buffer you
> will receive from the device, the usefulness of fences becomes very
> questionable :-)

Yeah, if you really are in a situation where there's no way to know
until the full frame capture has been completed which buffer is next,
then fences are useless.  You aren't in an implicit synchronization
setting either; you're in a "full flush" setting.  It's arguably worse
for performance but perhaps unavoidable?

Trying to understand. :-)

--Jason


> > There is other elements that would implement fencing, notably kmssink, but no
> > one actually dared porting it to atomic KMS, so clearly there is very little
> > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > uses), so then the operation will just fail where it worked before (breaking
> > userspace). If it was an explcit fence, we could handle that in GStreamer
> > cleanly as we do for new APIs.
> >
> > > > ## Chicken and egg problems
> > > >
> > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > above that Wayland has an explicit synchronization protocol that's of
> > > > questionable usefulness.  I would claim that basically any bit of
> > > > plumbing we do through window systems is currently of questionable
> > > > usefulness.  Why?
> > > >
> > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > me because I can't really use it unless it's implemented in all of the
> > > > compositors.  Until every Wayland compositor I care about my users
> > > > being able to use (which is basically all of them) supports the
> > > > extension, I have to continue carry around my pile of hacks to keep
> > > > implicit sync and Vulkan working nicely together.
> > > >
> > > > From the perspective of a Wayland compositor (I used to play in this
> > > > space), they'd love to implement the new explicit sync extension but
> > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > doesn't support any explicit sync APIs.
> > >
> > > As per the above correction, Wayland compositors aren't nearly as bad
> > > off as I initially thought.  There may still be weird screen capture
> > > cases but the normal cases of compositing and displaying via
> > > KMS/atomic should be in reasonably good shape.
> > >
> > > > So, yes, they can technically
> > > > implement the extension assuming the EGL stack they're running on has
> > > > the sync_file extensions but any client buffers which come in using
> > > > the explicit sync Wayland extension have to be composited and can't be
> > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > want compositors doing that because my users will complain about
> > > > performance issues due to the extra blit.
> > > >
> > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > all our problems, right?  It does, right up until someone decides that
> > > > they wan to screen capture their Wayland session via some hardware
> > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > until they're guaranteed that they can use explicit sync when
> > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > >
> > > > To make matters worse, since most things are doing implicit
> > > > synchronization today, it's really easy to get your explicit
> > > > synchronization wrong and never notice.  If you forget to pass a
> > > > sync_file into one place (say you never notice KMS doesn't support
> > > > them), it will probably work anyway thanks to all the implicit sync
> > > > that's going on elsewhere.
> > > >
> > > > So, clearly, we all need to go write piles of code that we can't
> > > > actually properly test until everyone else has written their piece and
> > > > then we use explicit sync if and only if all components support it.
> > > > Really?  We're going to do multiple years of development and then just
> > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > like a good plan to me.
> > > >
> > > >
> > > > ## A proposal: Implicit and explicit sync together
> > > >
> > > > How to solve all these chicken-and-egg problems is something I've been
> > > > giving quite a bit of thought (and talking with many others about) in
> > > > the last couple of years.  One motivation for this is that we have to
> > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > becoming increasingly unhappy with the way that synchronization,
> > > > memory residency, and command submission are inherently intertwined in
> > > > i915 and would like to break things apart.  Towards that end, I have
> > > > an actual proposal.
> > > >
> > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > The idea is that something like a Wayland compositor can switch to
> > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > would a sync_file passed via the explicit sync extension.  When it
> > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > converted), it can take it's sync_file for display and stuff it into
> > > > the dma-buf before handing it to KMS.
> > > >
> > > > Along with the kernel patches, I've also implemented support for this
> > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > only requirement on the Vulkan drivers is that you be able to export
> > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > Wayland to the explicit sync the Vulkan driver wants.
> > > >
> > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > like this will allow us to finally start moving the Linux ecosystem
> > > > over to explicit synchronization one piece at a time in a way that's
> > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > Once some pieces in the ecosystem start moving, there will be
> > > > motivation to start moving others and maybe we can actually build the
> > > > momentum to get most everything converted.
> > > >
> > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > >
> > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > >
> > > > At this point, I welcome your thoughts, comments, objections, and
> > > > maybe even help/review. :-)
> > > >
> > > > --Jason Ekstrand
> >
>
> --
> Regards,
>
> Laurent Pinchart

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 15:06         ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16 15:06 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > (I know I'm going to be spammed by so many mailing list ...)
> >
> > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > All,
> > > >
> > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > who reply will get at least one mailing list rejection.  However, this
> > > > is an issue that affects a LOT of components and that's why it's
> > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > well; I promise there's a concrete point/proposal at the end.
> > > >
> > > >
> > > > Explicit synchronization is the future of graphics and media.  At
> > > > least, that seems to be the consensus among all the graphics people
> > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > engineers recently who told me that doing explicit sync from the start
> > > > was one of the best engineering decisions Android ever made.  It's
> > > > also the direction being taken by more modern APIs such as Vulkan.
> > > >
> > > >
> > > > ## What are implicit and explicit synchronization?
> > > >
> > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > etc. are massively parallel and synchronization of some form is
> > > > required to ensure that everything happens in the right order and
> > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > when the client (whatever that means in any given context) provides
> > > > the dependency graph explicitly via some sort of synchronization
> > > > primitives.  If you're still confused, consider the following
> > > > examples:
> > > >
> > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > to make the API calls to render to the image before (in CPU time) it
> > > > makes the API calls which texture from the image.  As long as it does
> > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > rendering completes before the texturing happens and you get correct
> > > > contents.
> > > >
> > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > for instance, is currently built on implicit sync where the client
> > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > to tell the compositor it's done at which point the compositor can now
> > > > texture from the surface.  The hand-off ensures that the client's
> > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > >
> > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > a client (or multiple clients) can simultaneously build command
> > > > buffers in different threads where one of those command buffers
> > > > renders to an image and the other textures from it and then submit
> > > > both of them at the same time with instructions to the driver for
> > > > which order to execute them in.  The execution order is described via
> > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > extension, you can even submit the work which does the texturing
> > > > BEFORE the work which does the rendering and the driver will sort it
> > > > out.
> > > >
> > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > is that it leads to a lot of over-synchronization both in client space
> > > > and in driver/device space.  The client has to synchronize a lot more
> > > > because it has to ensure that the API calls happen in a particular
> > > > order.  The driver/device have to synchronize a lot more because they
> > > > never know what is going to end up being a synchronization point as an
> > > > API call on another thread/process may occur at any time.  As we move
> > > > to more and more multi-threaded programming this synchronization (on
> > > > the client-side especially) becomes more and more painful.
> > > >
> > > >
> > > > ## Current status in Linux
> > > >
> > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > which represents the "done" status for some bit of work.  Typically,
> > > > dma_fences are created as a by-product of someone submitting some bit
> > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > set of dma_fences on it representing shared (read) and exclusive
> > > > (write) access to the object.  When work is submitted which, for
> > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > rendering, display, video encode, etc.) and know which things to
> > > > submit in what order.
> > > >
> > > > For the last few years, we've had sync_file in the kernel and it's
> > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > the kernel that it will become signaled in finite time and, once it's
> > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > does some bit of queue'd work and the submitted work will wait for the
> > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > poll() if  you want to wait on it manually.
> > > >
> > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > drivers support it.  Here's a very quick overview of my understanding
> > > > of the status of various components (I don't know the status of
> > > > anything in the media world):
> > > >
> > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > synchronization primitives.
> > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > still implicit.
> > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > sync_files around but it's questionable how useful it is right now
> > > > (more on that later).
> > > >  - X11: With present, it has these "explicit" fence objects but
> > > > they're always a shmfence which lets the X server and client do a
> > > > userspace CPU-side hand-off without going over the socket (and
> > > > round-tripping through the kernel).  However, the only thing that
> > > > fence does is order the OpenGL API calls in the client and server and
> > > > the real synchronization is still implicit.
> > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > sync.
> > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > implicitly syncs sometimes due to it's internal memory residency
> > > > handling which can lead to over-synchronization.
> > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > explicit sync primitives.
> > >
> > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > are still in trouble but most Wayland compositors use atomic these
> > > days
> > >
> > > >  - v4l: ???
> > > >  - gstreamer: ???
> > > >  - Media APIs such as vaapi etc.:  ???
> >
> > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > synchronisation is something we do already with GL (even if limited). We place
> > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > these GLSync as late as possible (or superseed the sync if we queue more work
> > into the same GL context). That requires a special mode of operation of course.
> > We don't usually like making lazy blocking call implicit, as it tends to cause
> > random issues. If we need to wait, we think it's better to wait int he module
> > that is responsible, so in general, we try to negotiate and fallback locally
> > (it's plugin base, so this can be really messy otherwise).
> >
> > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > level APIs first. We need API that provides us these fence (in or out), and then
> > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > solve some of other issues that was discussed. Notably for camera capture, were
> > the HW timestamp is capture about at the same time the frame is ready. But the
> > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > deliver that metadata. It's the biggest pain point I've found, such an API would
> > be quite invasive or if made really generic, might just never be adopted widely
> > enough.
>
> Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> When you queue multiple buffers for camera capture for instance, you
> don't know until capture complete in which buffer the frame has been
> captured.

Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
start of frame capture which buffer it's getting written into.  I
would think that the kernel APIs could be adjusted (if we find good
reason to do so!) such that they return earlier and return a (buffer,
fence) pair.  Am I missing something fundamental about video here?

I must admit that V4L is a bit of an odd case since the kernel driver
is the producer and not the consumer.

> In the normal case buffers are processed in sequence, but if
> an error occurs during capture, they can be recycled internally and put
> to the back of the queue.

Are those errors something that can happen at any time in the middle
of a frame capture?  If so, that does make things stickier.

> Unless I'm mistaken, this problem also exists
> with stateful codecs. And if you don't know in advance which buffer you
> will receive from the device, the usefulness of fences becomes very
> questionable :-)

Yeah, if you really are in a situation where there's no way to know
until the full frame capture has been completed which buffer is next,
then fences are useless.  You aren't in an implicit synchronization
setting either; you're in a "full flush" setting.  It's arguably worse
for performance but perhaps unavoidable?

Trying to understand. :-)

--Jason


> > There is other elements that would implement fencing, notably kmssink, but no
> > one actually dared porting it to atomic KMS, so clearly there is very little
> > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > uses), so then the operation will just fail where it worked before (breaking
> > userspace). If it was an explcit fence, we could handle that in GStreamer
> > cleanly as we do for new APIs.
> >
> > > > ## Chicken and egg problems
> > > >
> > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > above that Wayland has an explicit synchronization protocol that's of
> > > > questionable usefulness.  I would claim that basically any bit of
> > > > plumbing we do through window systems is currently of questionable
> > > > usefulness.  Why?
> > > >
> > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > me because I can't really use it unless it's implemented in all of the
> > > > compositors.  Until every Wayland compositor I care about my users
> > > > being able to use (which is basically all of them) supports the
> > > > extension, I have to continue carry around my pile of hacks to keep
> > > > implicit sync and Vulkan working nicely together.
> > > >
> > > > From the perspective of a Wayland compositor (I used to play in this
> > > > space), they'd love to implement the new explicit sync extension but
> > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > doesn't support any explicit sync APIs.
> > >
> > > As per the above correction, Wayland compositors aren't nearly as bad
> > > off as I initially thought.  There may still be weird screen capture
> > > cases but the normal cases of compositing and displaying via
> > > KMS/atomic should be in reasonably good shape.
> > >
> > > > So, yes, they can technically
> > > > implement the extension assuming the EGL stack they're running on has
> > > > the sync_file extensions but any client buffers which come in using
> > > > the explicit sync Wayland extension have to be composited and can't be
> > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > want compositors doing that because my users will complain about
> > > > performance issues due to the extra blit.
> > > >
> > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > all our problems, right?  It does, right up until someone decides that
> > > > they wan to screen capture their Wayland session via some hardware
> > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > until they're guaranteed that they can use explicit sync when
> > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > >
> > > > To make matters worse, since most things are doing implicit
> > > > synchronization today, it's really easy to get your explicit
> > > > synchronization wrong and never notice.  If you forget to pass a
> > > > sync_file into one place (say you never notice KMS doesn't support
> > > > them), it will probably work anyway thanks to all the implicit sync
> > > > that's going on elsewhere.
> > > >
> > > > So, clearly, we all need to go write piles of code that we can't
> > > > actually properly test until everyone else has written their piece and
> > > > then we use explicit sync if and only if all components support it.
> > > > Really?  We're going to do multiple years of development and then just
> > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > like a good plan to me.
> > > >
> > > >
> > > > ## A proposal: Implicit and explicit sync together
> > > >
> > > > How to solve all these chicken-and-egg problems is something I've been
> > > > giving quite a bit of thought (and talking with many others about) in
> > > > the last couple of years.  One motivation for this is that we have to
> > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > becoming increasingly unhappy with the way that synchronization,
> > > > memory residency, and command submission are inherently intertwined in
> > > > i915 and would like to break things apart.  Towards that end, I have
> > > > an actual proposal.
> > > >
> > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > The idea is that something like a Wayland compositor can switch to
> > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > would a sync_file passed via the explicit sync extension.  When it
> > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > converted), it can take it's sync_file for display and stuff it into
> > > > the dma-buf before handing it to KMS.
> > > >
> > > > Along with the kernel patches, I've also implemented support for this
> > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > only requirement on the Vulkan drivers is that you be able to export
> > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > Wayland to the explicit sync the Vulkan driver wants.
> > > >
> > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > like this will allow us to finally start moving the Linux ecosystem
> > > > over to explicit synchronization one piece at a time in a way that's
> > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > Once some pieces in the ecosystem start moving, there will be
> > > > motivation to start moving others and maybe we can actually build the
> > > > momentum to get most everything converted.
> > > >
> > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > >
> > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > >
> > > > At this point, I welcome your thoughts, comments, objections, and
> > > > maybe even help/review. :-)
> > > >
> > > > --Jason Ekstrand
> >
>
> --
> Regards,
>
> Laurent Pinchart
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 14:19           ` Daniel Stone
@ 2020-03-16 15:33             ` Tomek Bury
  -1 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 15:33 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Laurent Pinchart, Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Dave Airlie,
	open list:DMA BUFFER SHARING FRAMEWORK

> GL and GLES are not relevant. What is relevant is EGL, which defines
> interfaces to make things work on the native platform.
Yes and no. This is what EGL spec says about sharing a texture between contexts:

"OpenGL and OpenGL ES makes no attempt to synchronize access to
texture objects. If a texture object is bound to more than one
context, then it is up to the programmer to ensure that the contents
of the object are not being changed via one context while another
context is using the texture object for rendering. The results of
changing a texture object while another context is using it are
undefined."

There are similar statements with regards to the lack of
synchronisation guarantees for EGL images or between GL and native
rendering, etc. But the main thing here is that EGL and Vulkan differ
significantly. The eglSwapBuffers() is expected to post an unspecified
"back buffer" to the display system using some internal driver magic.
EGL driver is then expected to obtain another back buffer at some
unspecified point in the future. Vulkan on the other hand is very
specific and explicit. The vkQueuePresentKHR() is expected to post a
specific vkImage with an explicit set of set of semaphores. Another
image is obtained through vkAcquireNextImageKHR() and it's the
application's decision whether it wants a fence, a semaphore, both or
none with the acquired buffer. The implicit synchronisation doesn't
mix well with Vulkan drivers and requires a lot of extra plumbing  in
the WSI code.

> If you are using EGL_WL_bind_wayland_display, then one of the things
> it is explicitly allowed/expected to do is to create a Wayland
> protocol interface between client and compositor, which can be used to
> pass buffer handles and metadata in a platform-specific way. Adding
> synchronisation is also possible.
Only one-way synchronisation is possible with this mechanism. There's
a standard protocol for recycling buffers - wl_buffer_release() so
buffer hand-over from the compositor to client remains unsynchronised
- see below.

> > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).
>
> You can add your own explicit synchronisation extension.
I could but that requires implementing in in the driver and in a
number of compositors, therefore a standard extension
zwp_linux_explicit_synchronization_v1 is much better choice here than
a custom one.

> In every cross-process and cross-subsystem usecase, synchronisation is
> obviously required. The two options for this are to implement kernel
> support for implicit synchronisation (as everyone else has done),
That would require major changes in driver architecture or a 2nd
mechanisms doing the same thing but in kernel space - both are
non-starters.

> or implement generic support for explicit synchronisation (as we have
> been working on with implementations inside Weston and Exosphere at
> least),
The zwp_linux_explicit_synchronization_v1 is a good step forward. I'm
using this extension as a main synchronisation mechanism in EGL and
Vulkan driver whenever available. I remember that Gustavo Padovan was
working on explicit sync support in the display system some time ago.
I hope it got merged into kernel by now, but I don't know to what
extend it's actually being used.

> or implement private support for explicit synchronisation,
If everything else fails, that would be the last resort scenario, but
far from ideal and very costly in terms of implementation and
maintenance as it would require maintaining custom patches for various
3rd party components or littering them with multiple custom explicit
synchronisation schemes.

> or do nothing and then be surprised at the lack of synchronisation.
Thank you, but no, thank you :)

Cheers,
Tomek

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 15:33             ` Tomek Bury
  0 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 15:33 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

> GL and GLES are not relevant. What is relevant is EGL, which defines
> interfaces to make things work on the native platform.
Yes and no. This is what EGL spec says about sharing a texture between contexts:

"OpenGL and OpenGL ES makes no attempt to synchronize access to
texture objects. If a texture object is bound to more than one
context, then it is up to the programmer to ensure that the contents
of the object are not being changed via one context while another
context is using the texture object for rendering. The results of
changing a texture object while another context is using it are
undefined."

There are similar statements with regards to the lack of
synchronisation guarantees for EGL images or between GL and native
rendering, etc. But the main thing here is that EGL and Vulkan differ
significantly. The eglSwapBuffers() is expected to post an unspecified
"back buffer" to the display system using some internal driver magic.
EGL driver is then expected to obtain another back buffer at some
unspecified point in the future. Vulkan on the other hand is very
specific and explicit. The vkQueuePresentKHR() is expected to post a
specific vkImage with an explicit set of set of semaphores. Another
image is obtained through vkAcquireNextImageKHR() and it's the
application's decision whether it wants a fence, a semaphore, both or
none with the acquired buffer. The implicit synchronisation doesn't
mix well with Vulkan drivers and requires a lot of extra plumbing  in
the WSI code.

> If you are using EGL_WL_bind_wayland_display, then one of the things
> it is explicitly allowed/expected to do is to create a Wayland
> protocol interface between client and compositor, which can be used to
> pass buffer handles and metadata in a platform-specific way. Adding
> synchronisation is also possible.
Only one-way synchronisation is possible with this mechanism. There's
a standard protocol for recycling buffers - wl_buffer_release() so
buffer hand-over from the compositor to client remains unsynchronised
- see below.

> > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).
>
> You can add your own explicit synchronisation extension.
I could but that requires implementing in in the driver and in a
number of compositors, therefore a standard extension
zwp_linux_explicit_synchronization_v1 is much better choice here than
a custom one.

> In every cross-process and cross-subsystem usecase, synchronisation is
> obviously required. The two options for this are to implement kernel
> support for implicit synchronisation (as everyone else has done),
That would require major changes in driver architecture or a 2nd
mechanisms doing the same thing but in kernel space - both are
non-starters.

> or implement generic support for explicit synchronisation (as we have
> been working on with implementations inside Weston and Exosphere at
> least),
The zwp_linux_explicit_synchronization_v1 is a good step forward. I'm
using this extension as a main synchronisation mechanism in EGL and
Vulkan driver whenever available. I remember that Gustavo Padovan was
working on explicit sync support in the display system some time ago.
I hope it got merged into kernel by now, but I don't know to what
extend it's actually being used.

> or implement private support for explicit synchronisation,
If everything else fails, that would be the last resort scenario, but
far from ideal and very costly in terms of implementation and
maintenance as it would require maintaining custom patches for various
3rd party components or littering them with multiple custom explicit
synchronisation schemes.

> or do nothing and then be surprised at the lack of synchronisation.
Thank you, but no, thank you :)

Cheers,
Tomek
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 15:33             ` Tomek Bury
@ 2020-03-16 16:03               ` Tomek Bury
  -1 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 16:03 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Laurent Pinchart, Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Dave Airlie,
	open list:DMA BUFFER SHARING FRAMEWORK

> vkAcquireNextImageKHR() [...] it's the application's decision whether it wants a fence, a semaphore, both or none
Correction: "or none" is not allowed

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 16:03               ` Tomek Bury
  0 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 16:03 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

> vkAcquireNextImageKHR() [...] it's the application's decision whether it wants a fence, a semaphore, both or none
Correction: "or none" is not allowed
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 15:33             ` Tomek Bury
@ 2020-03-16 16:04               ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16 16:04 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Daniel Stone, Laurent Pinchart, Nicolas Dufresne, Daniel Vetter,
	xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Bas Nieuwenhuizen, ML mesa-dev, Dave Airlie,
	open list:DMA BUFFER SHARING FRAMEWORK

On Mon, Mar 16, 2020 at 10:33 AM Tomek Bury <tomek.bury@gmail.com> wrote:
>
> > GL and GLES are not relevant. What is relevant is EGL, which defines
> > interfaces to make things work on the native platform.
> Yes and no. This is what EGL spec says about sharing a texture between contexts:
>
> "OpenGL and OpenGL ES makes no attempt to synchronize access to
> texture objects. If a texture object is bound to more than one
> context, then it is up to the programmer to ensure that the contents
> of the object are not being changed via one context while another
> context is using the texture object for rendering. The results of
> changing a texture object while another context is using it are
> undefined."
>
> There are similar statements with regards to the lack of
> synchronisation guarantees for EGL images or between GL and native
> rendering, etc. But the main thing here is that EGL and Vulkan differ
> significantly. The eglSwapBuffers() is expected to post an unspecified
> "back buffer" to the display system using some internal driver magic.
> EGL driver is then expected to obtain another back buffer at some
> unspecified point in the future. Vulkan on the other hand is very
> specific and explicit. The vkQueuePresentKHR() is expected to post a
> specific vkImage with an explicit set of set of semaphores. Another
> image is obtained through vkAcquireNextImageKHR() and it's the
> application's decision whether it wants a fence, a semaphore, both or
> none with the acquired buffer. The implicit synchronisation doesn't
> mix well with Vulkan drivers and requires a lot of extra plumbing  in
> the WSI code.

Yes, and that (the Vulkan issues in particular) is what I'm trying to
fix. :-)  (among other things...)  Assuming the kernel patch I linked
to, your usermode driver could stuff fences in the dma-buf without
having that be part of your kernel driver.  This assumes, of course,
that your kernel driver supports sync_file.

> > If you are using EGL_WL_bind_wayland_display, then one of the things
> > it is explicitly allowed/expected to do is to create a Wayland
> > protocol interface between client and compositor, which can be used to
> > pass buffer handles and metadata in a platform-specific way. Adding
> > synchronisation is also possible.
> Only one-way synchronisation is possible with this mechanism. There's
> a standard protocol for recycling buffers - wl_buffer_release() so
> buffer hand-over from the compositor to client remains unsynchronised
>
> - see below.
>
> > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).
> >
> > You can add your own explicit synchronisation extension.
> I could but that requires implementing in in the driver and in a
> number of compositors, therefore a standard extension
> zwp_linux_explicit_synchronization_v1 is much better choice here than
> a custom one.

I think you may be missing what Daniel is saying.  Wayland allows you
to do basically anything you want within your client and server-side
EGL implementations.  That could include the server-side EGL sending
an event with a fence every single time a flush operation happens in
the server-side GL/GLES implementation. (Could be glFlush, glFinish,
eglSwapBuffers, or other things).  Since wayland protocol events are
ordered, the client-side EGL implementation would get the most recent
flush event before it got the wl_buffer::release.  I fully agree that
it's rather cumbersome though.

> > In every cross-process and cross-subsystem usecase, synchronisation is
> > obviously required. The two options for this are to implement kernel
> > support for implicit synchronisation (as everyone else has done),
> That would require major changes in driver architecture or a 2nd
> mechanisms doing the same thing but in kernel space - both are
> non-starters.
>
> > or implement generic support for explicit synchronisation (as we have
> > been working on with implementations inside Weston and Exosphere at
> > least),
> The zwp_linux_explicit_synchronization_v1 is a good step forward. I'm
> using this extension as a main synchronisation mechanism in EGL and
> Vulkan driver whenever available. I remember that Gustavo Padovan was
> working on explicit sync support in the display system some time ago.
> I hope it got merged into kernel by now, but I don't know to what
> extend it's actually being used.

It is supported by KMS/atomic.  Legacy KMS, however, does not support it.

> > or implement private support for explicit synchronisation,
> If everything else fails, that would be the last resort scenario, but
> far from ideal and very costly in terms of implementation and
> maintenance as it would require maintaining custom patches for various
> 3rd party components or littering them with multiple custom explicit
> synchronisation schemes.

If you want to see explicit synchronization everywhere, I would very
much like to see more developers driving its adoption.  I implemented
support in the Intel Vulkan driver last week:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4169

Hopefully, that will provide some motivation for other compositors
(kwin, gnome-shell, etc.) because they now have a real user of it in
an upstream driver for a major desktop platform and not just a few
weston examples.  However, someone is going to have to drive the
actual development in those compositors.  I'd be very happy if more
people got involved, :-)

--Jason

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 16:04               ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16 16:04 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers, Nicolas Dufresne,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, Discussion of the development of and with GStreamer

On Mon, Mar 16, 2020 at 10:33 AM Tomek Bury <tomek.bury@gmail.com> wrote:
>
> > GL and GLES are not relevant. What is relevant is EGL, which defines
> > interfaces to make things work on the native platform.
> Yes and no. This is what EGL spec says about sharing a texture between contexts:
>
> "OpenGL and OpenGL ES makes no attempt to synchronize access to
> texture objects. If a texture object is bound to more than one
> context, then it is up to the programmer to ensure that the contents
> of the object are not being changed via one context while another
> context is using the texture object for rendering. The results of
> changing a texture object while another context is using it are
> undefined."
>
> There are similar statements with regards to the lack of
> synchronisation guarantees for EGL images or between GL and native
> rendering, etc. But the main thing here is that EGL and Vulkan differ
> significantly. The eglSwapBuffers() is expected to post an unspecified
> "back buffer" to the display system using some internal driver magic.
> EGL driver is then expected to obtain another back buffer at some
> unspecified point in the future. Vulkan on the other hand is very
> specific and explicit. The vkQueuePresentKHR() is expected to post a
> specific vkImage with an explicit set of set of semaphores. Another
> image is obtained through vkAcquireNextImageKHR() and it's the
> application's decision whether it wants a fence, a semaphore, both or
> none with the acquired buffer. The implicit synchronisation doesn't
> mix well with Vulkan drivers and requires a lot of extra plumbing  in
> the WSI code.

Yes, and that (the Vulkan issues in particular) is what I'm trying to
fix. :-)  (among other things...)  Assuming the kernel patch I linked
to, your usermode driver could stuff fences in the dma-buf without
having that be part of your kernel driver.  This assumes, of course,
that your kernel driver supports sync_file.

> > If you are using EGL_WL_bind_wayland_display, then one of the things
> > it is explicitly allowed/expected to do is to create a Wayland
> > protocol interface between client and compositor, which can be used to
> > pass buffer handles and metadata in a platform-specific way. Adding
> > synchronisation is also possible.
> Only one-way synchronisation is possible with this mechanism. There's
> a standard protocol for recycling buffers - wl_buffer_release() so
> buffer hand-over from the compositor to client remains unsynchronised
>
> - see below.
>
> > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).
> >
> > You can add your own explicit synchronisation extension.
> I could but that requires implementing in in the driver and in a
> number of compositors, therefore a standard extension
> zwp_linux_explicit_synchronization_v1 is much better choice here than
> a custom one.

I think you may be missing what Daniel is saying.  Wayland allows you
to do basically anything you want within your client and server-side
EGL implementations.  That could include the server-side EGL sending
an event with a fence every single time a flush operation happens in
the server-side GL/GLES implementation. (Could be glFlush, glFinish,
eglSwapBuffers, or other things).  Since wayland protocol events are
ordered, the client-side EGL implementation would get the most recent
flush event before it got the wl_buffer::release.  I fully agree that
it's rather cumbersome though.

> > In every cross-process and cross-subsystem usecase, synchronisation is
> > obviously required. The two options for this are to implement kernel
> > support for implicit synchronisation (as everyone else has done),
> That would require major changes in driver architecture or a 2nd
> mechanisms doing the same thing but in kernel space - both are
> non-starters.
>
> > or implement generic support for explicit synchronisation (as we have
> > been working on with implementations inside Weston and Exosphere at
> > least),
> The zwp_linux_explicit_synchronization_v1 is a good step forward. I'm
> using this extension as a main synchronisation mechanism in EGL and
> Vulkan driver whenever available. I remember that Gustavo Padovan was
> working on explicit sync support in the display system some time ago.
> I hope it got merged into kernel by now, but I don't know to what
> extend it's actually being used.

It is supported by KMS/atomic.  Legacy KMS, however, does not support it.

> > or implement private support for explicit synchronisation,
> If everything else fails, that would be the last resort scenario, but
> far from ideal and very costly in terms of implementation and
> maintenance as it would require maintaining custom patches for various
> 3rd party components or littering them with multiple custom explicit
> synchronisation schemes.

If you want to see explicit synchronization everywhere, I would very
much like to see more developers driving its adoption.  I implemented
support in the Intel Vulkan driver last week:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4169

Hopefully, that will provide some motivation for other compositors
(kwin, gnome-shell, etc.) because they now have a real user of it in
an upstream driver for a major desktop platform and not just a few
weston examples.  However, someone is going to have to drive the
actual development in those compositors.  I'd be very happy if more
people got involved, :-)

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 15:33             ` Tomek Bury
@ 2020-03-16 16:04               ` Daniel Stone
  -1 siblings, 0 replies; 101+ messages in thread
From: Daniel Stone @ 2020-03-16 16:04 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Laurent Pinchart, Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Dave Airlie,
	open list:DMA BUFFER SHARING FRAMEWORK

Hi,

On Mon, 16 Mar 2020 at 15:33, Tomek Bury <tomek.bury@gmail.com> wrote:
> > GL and GLES are not relevant. What is relevant is EGL, which defines
> > interfaces to make things work on the native platform.
> Yes and no. This is what EGL spec says about sharing a texture between contexts:

Contexts are different though ...

> There are similar statements with regards to the lack of
> synchronisation guarantees for EGL images or between GL and native
> rendering, etc.

This also isn't about native rendering.

> But the main thing here is that EGL and Vulkan differ
> significantly.

Sure, I totally agree.

> The eglSwapBuffers() is expected to post an unspecified
> "back buffer" to the display system using some internal driver magic.
> EGL driver is then expected to obtain another back buffer at some
> unspecified point in the future.

Yes, this is rather the point: EGL doesn't specify platform-related
'black magic' to make things just work, because that's part of the
platform implementation details. And, as things stand, on Linux one of
those things is implicit synchronisation, unless the desired end state
of your driver is no synchronisation.

This thread is a discussion about changing that.

> > If you are using EGL_WL_bind_wayland_display, then one of the things
> > it is explicitly allowed/expected to do is to create a Wayland
> > protocol interface between client and compositor, which can be used to
> > pass buffer handles and metadata in a platform-specific way. Adding
> > synchronisation is also possible.
> Only one-way synchronisation is possible with this mechanism. There's
> a standard protocol for recycling buffers - wl_buffer_release() so
> buffer hand-over from the compositor to client remains unsynchronised
> - see below.

That's not true; you can post back a sync token every time the client
buffer is used by the compositor.

> > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).
> >
> > You can add your own explicit synchronisation extension.
> I could but that requires implementing in in the driver and in a
> number of compositors, therefore a standard extension
> zwp_linux_explicit_synchronization_v1 is much better choice here than
> a custom one.

EGL_WL_bind_wayland_display is explicitly designed to allow each
driver to implement its own private extensions without modifying
compositors. For instance, Mesa adds the `wl_drm` extension, which is
used for bidirectional communication between the EGL implementations
in the client and compositor address spaces, without modifying either.

> > In every cross-process and cross-subsystem usecase, synchronisation is
> > obviously required. The two options for this are to implement kernel
> > support for implicit synchronisation (as everyone else has done),
> That would require major changes in driver architecture or a 2nd
> mechanisms doing the same thing but in kernel space - both are
> non-starters.

OK. As it stands, everyone else has the kernel mechanism (e.g. via
dmabuf resv), so in this case if you are reinventing the underlying
platform in a proprietary stack, you get to solve the same problems
yourselves.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 16:04               ` Daniel Stone
  0 siblings, 0 replies; 101+ messages in thread
From: Daniel Stone @ 2020-03-16 16:04 UTC (permalink / raw)
  To: Tomek Bury
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

Hi,

On Mon, 16 Mar 2020 at 15:33, Tomek Bury <tomek.bury@gmail.com> wrote:
> > GL and GLES are not relevant. What is relevant is EGL, which defines
> > interfaces to make things work on the native platform.
> Yes and no. This is what EGL spec says about sharing a texture between contexts:

Contexts are different though ...

> There are similar statements with regards to the lack of
> synchronisation guarantees for EGL images or between GL and native
> rendering, etc.

This also isn't about native rendering.

> But the main thing here is that EGL and Vulkan differ
> significantly.

Sure, I totally agree.

> The eglSwapBuffers() is expected to post an unspecified
> "back buffer" to the display system using some internal driver magic.
> EGL driver is then expected to obtain another back buffer at some
> unspecified point in the future.

Yes, this is rather the point: EGL doesn't specify platform-related
'black magic' to make things just work, because that's part of the
platform implementation details. And, as things stand, on Linux one of
those things is implicit synchronisation, unless the desired end state
of your driver is no synchronisation.

This thread is a discussion about changing that.

> > If you are using EGL_WL_bind_wayland_display, then one of the things
> > it is explicitly allowed/expected to do is to create a Wayland
> > protocol interface between client and compositor, which can be used to
> > pass buffer handles and metadata in a platform-specific way. Adding
> > synchronisation is also possible.
> Only one-way synchronisation is possible with this mechanism. There's
> a standard protocol for recycling buffers - wl_buffer_release() so
> buffer hand-over from the compositor to client remains unsynchronised
> - see below.

That's not true; you can post back a sync token every time the client
buffer is used by the compositor.

> > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).
> >
> > You can add your own explicit synchronisation extension.
> I could but that requires implementing in in the driver and in a
> number of compositors, therefore a standard extension
> zwp_linux_explicit_synchronization_v1 is much better choice here than
> a custom one.

EGL_WL_bind_wayland_display is explicitly designed to allow each
driver to implement its own private extensions without modifying
compositors. For instance, Mesa adds the `wl_drm` extension, which is
used for bidirectional communication between the EGL implementations
in the client and compositor address spaces, without modifying either.

> > In every cross-process and cross-subsystem usecase, synchronisation is
> > obviously required. The two options for this are to implement kernel
> > support for implicit synchronisation (as everyone else has done),
> That would require major changes in driver architecture or a 2nd
> mechanisms doing the same thing but in kernel space - both are
> non-starters.

OK. As it stands, everyone else has the kernel mechanism (e.g. via
dmabuf resv), so in this case if you are reinventing the underlying
platform in a proprietary stack, you get to solve the same problems
yourselves.

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 16:04               ` Daniel Stone
@ 2020-03-16 17:11                 ` Tomek Bury
  -1 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 17:11 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Laurent Pinchart, Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, Bas Nieuwenhuizen, ML mesa-dev, Dave Airlie,
	open list:DMA BUFFER SHARING FRAMEWORK

> That's not true; you can post back a sync token every time the client
> buffer is used by the compositor.
Technically, yes but it's very cumbersome and invasive to the point
where it becomes impractical. Explicit sync is much cleaner solution.

> For instance, Mesa adds the `wl_drm` extension, which is
> used for bidirectional communication between the EGL implementations
> in the client and compositor address spaces, without modifying either.
Broadcom driver adds "wl_nexus" extension which servers similar
purpose for both EGL and Vulkan WSI

> OK. As it stands, everyone else has the kernel mechanism (e.g. via
> dmabuf resv), so in this case if you are reinventing the underlying
> platform in a proprietary stack, you get to solve the same problems
> yourselves.
That's an important point. In the explicit synchronisation scenario
the sync token is passed with the buffer. It becomes irrelevant where
the token originated from, as long as it's a commonly used type of
token, i.e. dma_fence in kernel space or sync_fd in user space. That
allows for greater flexibility and works with and without dma
reservation objects.

Cheers,
Tomek

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 17:11                 ` Tomek Bury
  0 siblings, 0 replies; 101+ messages in thread
From: Tomek Bury @ 2020-03-16 17:11 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

> That's not true; you can post back a sync token every time the client
> buffer is used by the compositor.
Technically, yes but it's very cumbersome and invasive to the point
where it becomes impractical. Explicit sync is much cleaner solution.

> For instance, Mesa adds the `wl_drm` extension, which is
> used for bidirectional communication between the EGL implementations
> in the client and compositor address spaces, without modifying either.
Broadcom driver adds "wl_nexus" extension which servers similar
purpose for both EGL and Vulkan WSI

> OK. As it stands, everyone else has the kernel mechanism (e.g. via
> dmabuf resv), so in this case if you are reinventing the underlying
> platform in a proprietary stack, you get to solve the same problems
> yourselves.
That's an important point. In the explicit synchronisation scenario
the sync token is passed with the buffer. It becomes irrelevant where
the token originated from, as long as it's a commonly used type of
token, i.e. dma_fence in kernel space or sync_fd in user space. That
allows for greater flexibility and works with and without dma
reservation objects.

Cheers,
Tomek
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16  9:57         ` Michel Dänzer
  (?)
@ 2020-03-16 18:33         ` Marek Olšák
  2020-03-17 10:01             ` Michel Dänzer
  -1 siblings, 1 reply; 101+ messages in thread
From: Marek Olšák @ 2020-03-16 18:33 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media


[-- Attachment #1.1: Type: text/plain, Size: 1743 bytes --]

On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net> wrote:

> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> > The synchronization works because the Mesa driver waits for idle (drains
> > the GFX pipeline) at the end of command buffers and there is only 1
> > graphics queue, so everything is ordered.
> >
> > The GFX pipeline runs asynchronously to the command buffer, meaning the
> > command buffer only starts draws and doesn't wait for completion. If the
> > Mesa driver didn't wait at the end of the command buffer, the command
> > buffer would finish and a different process could start execution of its
> > own command buffer while shaders of the previous process are still
> running.
> >
> > If the Mesa driver submits a command buffer internally (because it's
> full),
> > it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> > ended and a new one started.
> >
> > The waiting at the end of command buffers happens only when the flush is
> > external (Swap buffers, glFlush).
> >
> > It's a performance problem, because the GFX queue is blocked until the
> GFX
> > pipeline is drained at the end of every frame at least.
> >
> > So explicit fences for SwapBuffers would help.
>
> Not sure what difference it would make, since the same thing needs to be
> done for explicit fences as well, doesn't it?
>

No. Explicit fences don't require userspace to wait for idle in the command
buffer. Fences are signalled when the last draw is complete and caches are
flushed. Before that happens, any command buffer that is not dependent on
the fence can start execution. There is never a need for the GPU to be idle
if there is enough independent work to do.

Marek

[-- Attachment #1.2: Type: text/html, Size: 2239 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 15:06         ` Jason Ekstrand
@ 2020-03-16 21:15           ` Laurent Pinchart
  -1 siblings, 0 replies; 101+ messages in thread
From: Laurent Pinchart @ 2020-03-16 21:15 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Nicolas Dufresne, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

Hi Jason,

On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> >> (I know I'm going to be spammed by so many mailing list ...)
> >>
> >> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> >>> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>>> All,
> >>>>
> >>>> Sorry for casting such a broad net with this one. I'm sure most people
> >>>> who reply will get at least one mailing list rejection.  However, this
> >>>> is an issue that affects a LOT of components and that's why it's
> >>>> thorny to begin with.  Please pardon the length of this e-mail as
> >>>> well; I promise there's a concrete point/proposal at the end.
> >>>>
> >>>>
> >>>> Explicit synchronization is the future of graphics and media.  At
> >>>> least, that seems to be the consensus among all the graphics people
> >>>> I've talked to.  I had a chat with one of the lead Android graphics
> >>>> engineers recently who told me that doing explicit sync from the start
> >>>> was one of the best engineering decisions Android ever made.  It's
> >>>> also the direction being taken by more modern APIs such as Vulkan.
> >>>>
> >>>>
> >>>> ## What are implicit and explicit synchronization?
> >>>>
> >>>> For those that aren't familiar with this space, GPUs, media encoders,
> >>>> etc. are massively parallel and synchronization of some form is
> >>>> required to ensure that everything happens in the right order and
> >>>> avoid data races.  Implicit synchronization is when bits of work (3D,
> >>>> compute, video encode, etc.) are implicitly based on the absolute
> >>>> CPU-time order in which API calls occur.  Explicit synchronization is
> >>>> when the client (whatever that means in any given context) provides
> >>>> the dependency graph explicitly via some sort of synchronization
> >>>> primitives.  If you're still confused, consider the following
> >>>> examples:
> >>>>
> >>>> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> >>>> two OpenGL contexts sharing an image where one writes to it and the
> >>>> other textures from it.  The way the OpenGL spec works, the client has
> >>>> to make the API calls to render to the image before (in CPU time) it
> >>>> makes the API calls which texture from the image.  As long as it does
> >>>> this (and maybe inserts a glFlush?), the driver will ensure that the
> >>>> rendering completes before the texturing happens and you get correct
> >>>> contents.
> >>>>
> >>>> Implicit synchronization can also happen across processes.  Wayland,
> >>>> for instance, is currently built on implicit sync where the client
> >>>> does their rendering and then does a hand-off (via wl_surface::commit)
> >>>> to tell the compositor it's done at which point the compositor can now
> >>>> texture from the surface.  The hand-off ensures that the client's
> >>>> OpenGL API calls happen before the server's OpenGL API calls.
> >>>>
> >>>> A good example of explicit synchronization is the Vulkan API.  There,
> >>>> a client (or multiple clients) can simultaneously build command
> >>>> buffers in different threads where one of those command buffers
> >>>> renders to an image and the other textures from it and then submit
> >>>> both of them at the same time with instructions to the driver for
> >>>> which order to execute them in.  The execution order is described via
> >>>> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> >>>> extension, you can even submit the work which does the texturing
> >>>> BEFORE the work which does the rendering and the driver will sort it
> >>>> out.
> >>>>
> >>>> The #1 problem with implicit synchronization (which explicit solves)
> >>>> is that it leads to a lot of over-synchronization both in client space
> >>>> and in driver/device space.  The client has to synchronize a lot more
> >>>> because it has to ensure that the API calls happen in a particular
> >>>> order.  The driver/device have to synchronize a lot more because they
> >>>> never know what is going to end up being a synchronization point as an
> >>>> API call on another thread/process may occur at any time.  As we move
> >>>> to more and more multi-threaded programming this synchronization (on
> >>>> the client-side especially) becomes more and more painful.
> >>>>
> >>>>
> >>>> ## Current status in Linux
> >>>>
> >>>> Implicit synchronization in Linux works via a the kernel's internal
> >>>> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> >>>> which represents the "done" status for some bit of work.  Typically,
> >>>> dma_fences are created as a by-product of someone submitting some bit
> >>>> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> >>>> set of dma_fences on it representing shared (read) and exclusive
> >>>> (write) access to the object.  When work is submitted which, for
> >>>> instance renders to the dma_buf, it's queued waiting on all the fences
> >>>> on the dma_buf and and a dma_fence is created representing the end of
> >>>> said rendering work and it's installed as the dma_buf's exclusive
> >>>> fence.  This way, the kernel can manage all its internal queues (3D
> >>>> rendering, display, video encode, etc.) and know which things to
> >>>> submit in what order.
> >>>>
> >>>> For the last few years, we've had sync_file in the kernel and it's
> >>>> plumbed into some drivers.  A sync_file is just a wrapper around a
> >>>> single dma_fence.  A sync_file is typically created as a by-product of
> >>>> submitting work (3D, compute, etc.) to the kernel and is signaled when
> >>>> that work completes.  When a sync_file is created, it is guaranteed by
> >>>> the kernel that it will become signaled in finite time and, once it's
> >>>> signaled, it remains signaled for the rest of time.  A sync_file is
> >>>> represented in UAPIs as a file descriptor and can be used with normal
> >>>> file APIs such as dup().  It can be passed into another UAPI which
> >>>> does some bit of queue'd work and the submitted work will wait for the
> >>>> sync_file to be triggered before executing.  A sync_file also supports
> >>>> poll() if  you want to wait on it manually.
> >>>>
> >>>> Unfortunately, sync_file is not broadly used and not all kernel GPU
> >>>> drivers support it.  Here's a very quick overview of my understanding
> >>>> of the status of various components (I don't know the status of
> >>>> anything in the media world):
> >>>>
> >>>>  - Vulkan: Explicit synchronization all the way but we have to go
> >>>> implicit as soon as we interact with a window-system.  Vulkan has APIs
> >>>> to import/export sync_files to/from it's VkSemaphore and VkFence
> >>>> synchronization primitives.
> >>>>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> >>>> enable some forms of explicit sync via sync_file but OpenGL itself is
> >>>> still implicit.
> >>>>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> >>>> via EGL/OpenGL).  There is an unstable extension to allow passing
> >>>> sync_files around but it's questionable how useful it is right now
> >>>> (more on that later).
> >>>>  - X11: With present, it has these "explicit" fence objects but
> >>>> they're always a shmfence which lets the X server and client do a
> >>>> userspace CPU-side hand-off without going over the socket (and
> >>>> round-tripping through the kernel).  However, the only thing that
> >>>> fence does is order the OpenGL API calls in the client and server and
> >>>> the real synchronization is still implicit.
> >>>>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> >>>> sync.
> >>>>  - linux/amdgpu: Supports sync_file and syncobj but it still
> >>>> implicitly syncs sometimes due to it's internal memory residency
> >>>> handling which can lead to over-synchronization.
> >>>>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> >>>> explicit sync primitives.
> >>>
> >>> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> >>> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> >>> are still in trouble but most Wayland compositors use atomic these
> >>> days
> >>>
> >>>>  - v4l: ???
> >>>>  - gstreamer: ???
> >>>>  - Media APIs such as vaapi etc.:  ???
> >>
> >> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> >> synchronisation is something we do already with GL (even if limited). We place
> >> GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> >> these GLSync as late as possible (or superseed the sync if we queue more work
> >> into the same GL context). That requires a special mode of operation of course.
> >> We don't usually like making lazy blocking call implicit, as it tends to cause
> >> random issues. If we need to wait, we think it's better to wait int he module
> >> that is responsible, so in general, we try to negotiate and fallback locally
> >> (it's plugin base, so this can be really messy otherwise).
> >>
> >> So basically this problem needs to be solved in V4L2, VAAPI and other lower
> >> level APIs first. We need API that provides us these fence (in or out), and then
> >> we can consider using them. For V4L2, there was an attempt, but it was a bit of
> >> a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> >> solve some of other issues that was discussed. Notably for camera capture, were
> >> the HW timestamp is capture about at the same time the frame is ready. But the
> >> timestamp is not part of the paylaod, so you need an entire API asynchronously
> >> deliver that metadata. It's the biggest pain point I've found, such an API would
> >> be quite invasive or if made really generic, might just never be adopted widely
> >> enough.
> >
> > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > When you queue multiple buffers for camera capture for instance, you
> > don't know until capture complete in which buffer the frame has been
> > captured.
> 
> Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> start of frame capture which buffer it's getting written into.  I
> would think that the kernel APIs could be adjusted (if we find good
> reason to do so!) such that they return earlier and return a (buffer,
> fence) pair.  Am I missing something fundamental about video here?

For cameras I believe we could do that, yes. I was pointing out the
issues caused by the current API. For video decoders I'll let Nicolas
answer the question, he's way more knowledgeable that I am on that
topic.

> I must admit that V4L is a bit of an odd case since the kernel driver
> is the producer and not the consumer.

Note that V4L2 can be a consumer too. Video output with V4L2 is less
frequent than video capture (but it still exists), and codecs and other
memory-to-memory processing devices (colorspace converters, scalers,
...) are both consumers and producers.

> > In the normal case buffers are processed in sequence, but if
> > an error occurs during capture, they can be recycled internally and put
> > to the back of the queue.
> 
> Are those errors something that can happen at any time in the middle
> of a frame capture?  If so, that does make things stickier.

Yes it can. Think of packet loss when capturing from a USB webcam for
instance. 

> > Unless I'm mistaken, this problem also exists
> > with stateful codecs. And if you don't know in advance which buffer you
> > will receive from the device, the usefulness of fences becomes very
> > questionable :-)
> 
> Yeah, if you really are in a situation where there's no way to know
> until the full frame capture has been completed which buffer is next,
> then fences are useless.  You aren't in an implicit synchronization
> setting either; you're in a "full flush" setting.  It's arguably worse
> for performance but perhaps unavoidable?

Probably unavoidable in some cases, but nothing that should get in the
way for the discussion at hand: there's no need to migrate away from
implicit sync when there's implicit sync in the first place :-)

I think we need to analyse the use cases here, and figure out at least
guidelines for userspace, otherwise applications will wonder what
behaviour to implement, and we'll end up with a wide variety of them.
Even just on the kernel side, some V4L2 capture driver will pass
erroneous frames to userspace (thus guaranteeing ordering, but without
early notification of errors), some will require the frame
automatically, and at least one (uvcvideo) has a module parameter to
pick the desired behaviour.

> Trying to understand. :-)

So am I :-)

> >> There is other elements that would implement fencing, notably kmssink, but no
> >> one actually dared porting it to atomic KMS, so clearly there is very little
> >> comunity interest. glimagsink could clearly benifit. Right now if we import a
> >> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> >> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> >> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> >> uses), so then the operation will just fail where it worked before (breaking
> >> userspace). If it was an explcit fence, we could handle that in GStreamer
> >> cleanly as we do for new APIs.
> >>
> >>>> ## Chicken and egg problems
> >>>>
> >>>> Ok, this is where it starts getting depressing.  I made the claim
> >>>> above that Wayland has an explicit synchronization protocol that's of
> >>>> questionable usefulness.  I would claim that basically any bit of
> >>>> plumbing we do through window systems is currently of questionable
> >>>> usefulness.  Why?
> >>>>
> >>>> From my perspective, as a Vulkan driver developer, I have to deal with
> >>>> the fact that Vulkan is an explicit sync API but Wayland and X11
> >>>> aren't.  Unfortunately, the Wayland extension solves zero problems for
> >>>> me because I can't really use it unless it's implemented in all of the
> >>>> compositors.  Until every Wayland compositor I care about my users
> >>>> being able to use (which is basically all of them) supports the
> >>>> extension, I have to continue carry around my pile of hacks to keep
> >>>> implicit sync and Vulkan working nicely together.
> >>>>
> >>>> From the perspective of a Wayland compositor (I used to play in this
> >>>> space), they'd love to implement the new explicit sync extension but
> >>>> can't.  Sure, they could wire up the extension, but the moment they go
> >>>> to flip a client buffer to the screen directly, they discover that KMS
> >>>> doesn't support any explicit sync APIs.
> >>>
> >>> As per the above correction, Wayland compositors aren't nearly as bad
> >>> off as I initially thought.  There may still be weird screen capture
> >>> cases but the normal cases of compositing and displaying via
> >>> KMS/atomic should be in reasonably good shape.
> >>>
> >>>> So, yes, they can technically
> >>>> implement the extension assuming the EGL stack they're running on has
> >>>> the sync_file extensions but any client buffers which come in using
> >>>> the explicit sync Wayland extension have to be composited and can't be
> >>>> scanned out directly.  As a 3D driver developer, I absolutely don't
> >>>> want compositors doing that because my users will complain about
> >>>> performance issues due to the extra blit.
> >>>>
> >>>> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> >>>> all our problems, right?  It does, right up until someone decides that
> >>>> they wan to screen capture their Wayland session via some hardware
> >>>> media encoder that doesn't support explicit sync.  Now we have to
> >>>> plumb it all the way through the media stack, gstreamer, etc.  Great,
> >>>> so let's do that!  Oh, but gstreamer won't want to plumb it through
> >>>> until they're guaranteed that they can use explicit sync when
> >>>> displaying on X11 or Wayland.  Are you seeing the problem?
> >>>>
> >>>> To make matters worse, since most things are doing implicit
> >>>> synchronization today, it's really easy to get your explicit
> >>>> synchronization wrong and never notice.  If you forget to pass a
> >>>> sync_file into one place (say you never notice KMS doesn't support
> >>>> them), it will probably work anyway thanks to all the implicit sync
> >>>> that's going on elsewhere.
> >>>>
> >>>> So, clearly, we all need to go write piles of code that we can't
> >>>> actually properly test until everyone else has written their piece and
> >>>> then we use explicit sync if and only if all components support it.
> >>>> Really?  We're going to do multiple years of development and then just
> >>>> hope it works when we finally flip the switch?  That doesn't sound
> >>>> like a good plan to me.
> >>>>
> >>>>
> >>>> ## A proposal: Implicit and explicit sync together
> >>>>
> >>>> How to solve all these chicken-and-egg problems is something I've been
> >>>> giving quite a bit of thought (and talking with many others about) in
> >>>> the last couple of years.  One motivation for this is that we have to
> >>>> deal with a mismatch in Vulkan.  Another motivation is that I'm
> >>>> becoming increasingly unhappy with the way that synchronization,
> >>>> memory residency, and command submission are inherently intertwined in
> >>>> i915 and would like to break things apart.  Towards that end, I have
> >>>> an actual proposal.
> >>>>
> >>>> A couple weeks ago, I sent a series of patches to the dri-devel
> >>>> mailing list which adds a pair of new ioctls to dma-buf which allow
> >>>> userspace to manually import or export a sync_file from a dma-buf.
> >>>> The idea is that something like a Wayland compositor can switch to
> >>>> 100% explicit sync internally once the ioctl is available.  If it gets
> >>>> buffers in from a client that doesn't use the explicit sync extension,
> >>>> it can pull a sync_file from the dma-buf and use that exactly as it
> >>>> would a sync_file passed via the explicit sync extension.  When it
> >>>> goes to scan out a user buffer and discovers that KMS doesn't accept
> >>>> sync_files (or if it tries to use that pesky media encoder no one has
> >>>> converted), it can take it's sync_file for display and stuff it into
> >>>> the dma-buf before handing it to KMS.
> >>>>
> >>>> Along with the kernel patches, I've also implemented support for this
> >>>> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> >>>> only requirement on the Vulkan drivers is that you be able to export
> >>>> any VkSemaphore as a sync_file and temporarily import a sync_file into
> >>>> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> >>>> driver only ever sees explicit synchronization via sync_file.  The WSI
> >>>> code uses these new ioctls to translate the implicit sync of X11 and
> >>>> Wayland to the explicit sync the Vulkan driver wants.
> >>>>
> >>>> I'm hoping (and here's where I want a sanity check) that a simple API
> >>>> like this will allow us to finally start moving the Linux ecosystem
> >>>> over to explicit synchronization one piece at a time in a way that's
> >>>> actually correct.  (No Wayland explicit sync with compositors hoping
> >>>> KMS magically works even though it doesn't have a sync_file API.)
> >>>> Once some pieces in the ecosystem start moving, there will be
> >>>> motivation to start moving others and maybe we can actually build the
> >>>> momentum to get most everything converted.
> >>>>
> >>>> For reference, you can find the kernel RFC patches and mesa MR here:
> >>>>
> >>>> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> >>>>
> >>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> >>>>
> >>>> At this point, I welcome your thoughts, comments, objections, and
> >>>> maybe even help/review. :-)

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 21:15           ` Laurent Pinchart
  0 siblings, 0 replies; 101+ messages in thread
From: Laurent Pinchart @ 2020-03-16 21:15 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

Hi Jason,

On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> >> (I know I'm going to be spammed by so many mailing list ...)
> >>
> >> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> >>> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>>> All,
> >>>>
> >>>> Sorry for casting such a broad net with this one. I'm sure most people
> >>>> who reply will get at least one mailing list rejection.  However, this
> >>>> is an issue that affects a LOT of components and that's why it's
> >>>> thorny to begin with.  Please pardon the length of this e-mail as
> >>>> well; I promise there's a concrete point/proposal at the end.
> >>>>
> >>>>
> >>>> Explicit synchronization is the future of graphics and media.  At
> >>>> least, that seems to be the consensus among all the graphics people
> >>>> I've talked to.  I had a chat with one of the lead Android graphics
> >>>> engineers recently who told me that doing explicit sync from the start
> >>>> was one of the best engineering decisions Android ever made.  It's
> >>>> also the direction being taken by more modern APIs such as Vulkan.
> >>>>
> >>>>
> >>>> ## What are implicit and explicit synchronization?
> >>>>
> >>>> For those that aren't familiar with this space, GPUs, media encoders,
> >>>> etc. are massively parallel and synchronization of some form is
> >>>> required to ensure that everything happens in the right order and
> >>>> avoid data races.  Implicit synchronization is when bits of work (3D,
> >>>> compute, video encode, etc.) are implicitly based on the absolute
> >>>> CPU-time order in which API calls occur.  Explicit synchronization is
> >>>> when the client (whatever that means in any given context) provides
> >>>> the dependency graph explicitly via some sort of synchronization
> >>>> primitives.  If you're still confused, consider the following
> >>>> examples:
> >>>>
> >>>> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> >>>> two OpenGL contexts sharing an image where one writes to it and the
> >>>> other textures from it.  The way the OpenGL spec works, the client has
> >>>> to make the API calls to render to the image before (in CPU time) it
> >>>> makes the API calls which texture from the image.  As long as it does
> >>>> this (and maybe inserts a glFlush?), the driver will ensure that the
> >>>> rendering completes before the texturing happens and you get correct
> >>>> contents.
> >>>>
> >>>> Implicit synchronization can also happen across processes.  Wayland,
> >>>> for instance, is currently built on implicit sync where the client
> >>>> does their rendering and then does a hand-off (via wl_surface::commit)
> >>>> to tell the compositor it's done at which point the compositor can now
> >>>> texture from the surface.  The hand-off ensures that the client's
> >>>> OpenGL API calls happen before the server's OpenGL API calls.
> >>>>
> >>>> A good example of explicit synchronization is the Vulkan API.  There,
> >>>> a client (or multiple clients) can simultaneously build command
> >>>> buffers in different threads where one of those command buffers
> >>>> renders to an image and the other textures from it and then submit
> >>>> both of them at the same time with instructions to the driver for
> >>>> which order to execute them in.  The execution order is described via
> >>>> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> >>>> extension, you can even submit the work which does the texturing
> >>>> BEFORE the work which does the rendering and the driver will sort it
> >>>> out.
> >>>>
> >>>> The #1 problem with implicit synchronization (which explicit solves)
> >>>> is that it leads to a lot of over-synchronization both in client space
> >>>> and in driver/device space.  The client has to synchronize a lot more
> >>>> because it has to ensure that the API calls happen in a particular
> >>>> order.  The driver/device have to synchronize a lot more because they
> >>>> never know what is going to end up being a synchronization point as an
> >>>> API call on another thread/process may occur at any time.  As we move
> >>>> to more and more multi-threaded programming this synchronization (on
> >>>> the client-side especially) becomes more and more painful.
> >>>>
> >>>>
> >>>> ## Current status in Linux
> >>>>
> >>>> Implicit synchronization in Linux works via a the kernel's internal
> >>>> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> >>>> which represents the "done" status for some bit of work.  Typically,
> >>>> dma_fences are created as a by-product of someone submitting some bit
> >>>> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> >>>> set of dma_fences on it representing shared (read) and exclusive
> >>>> (write) access to the object.  When work is submitted which, for
> >>>> instance renders to the dma_buf, it's queued waiting on all the fences
> >>>> on the dma_buf and and a dma_fence is created representing the end of
> >>>> said rendering work and it's installed as the dma_buf's exclusive
> >>>> fence.  This way, the kernel can manage all its internal queues (3D
> >>>> rendering, display, video encode, etc.) and know which things to
> >>>> submit in what order.
> >>>>
> >>>> For the last few years, we've had sync_file in the kernel and it's
> >>>> plumbed into some drivers.  A sync_file is just a wrapper around a
> >>>> single dma_fence.  A sync_file is typically created as a by-product of
> >>>> submitting work (3D, compute, etc.) to the kernel and is signaled when
> >>>> that work completes.  When a sync_file is created, it is guaranteed by
> >>>> the kernel that it will become signaled in finite time and, once it's
> >>>> signaled, it remains signaled for the rest of time.  A sync_file is
> >>>> represented in UAPIs as a file descriptor and can be used with normal
> >>>> file APIs such as dup().  It can be passed into another UAPI which
> >>>> does some bit of queue'd work and the submitted work will wait for the
> >>>> sync_file to be triggered before executing.  A sync_file also supports
> >>>> poll() if  you want to wait on it manually.
> >>>>
> >>>> Unfortunately, sync_file is not broadly used and not all kernel GPU
> >>>> drivers support it.  Here's a very quick overview of my understanding
> >>>> of the status of various components (I don't know the status of
> >>>> anything in the media world):
> >>>>
> >>>>  - Vulkan: Explicit synchronization all the way but we have to go
> >>>> implicit as soon as we interact with a window-system.  Vulkan has APIs
> >>>> to import/export sync_files to/from it's VkSemaphore and VkFence
> >>>> synchronization primitives.
> >>>>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> >>>> enable some forms of explicit sync via sync_file but OpenGL itself is
> >>>> still implicit.
> >>>>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> >>>> via EGL/OpenGL).  There is an unstable extension to allow passing
> >>>> sync_files around but it's questionable how useful it is right now
> >>>> (more on that later).
> >>>>  - X11: With present, it has these "explicit" fence objects but
> >>>> they're always a shmfence which lets the X server and client do a
> >>>> userspace CPU-side hand-off without going over the socket (and
> >>>> round-tripping through the kernel).  However, the only thing that
> >>>> fence does is order the OpenGL API calls in the client and server and
> >>>> the real synchronization is still implicit.
> >>>>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> >>>> sync.
> >>>>  - linux/amdgpu: Supports sync_file and syncobj but it still
> >>>> implicitly syncs sometimes due to it's internal memory residency
> >>>> handling which can lead to over-synchronization.
> >>>>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> >>>> explicit sync primitives.
> >>>
> >>> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> >>> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> >>> are still in trouble but most Wayland compositors use atomic these
> >>> days
> >>>
> >>>>  - v4l: ???
> >>>>  - gstreamer: ???
> >>>>  - Media APIs such as vaapi etc.:  ???
> >>
> >> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> >> synchronisation is something we do already with GL (even if limited). We place
> >> GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> >> these GLSync as late as possible (or superseed the sync if we queue more work
> >> into the same GL context). That requires a special mode of operation of course.
> >> We don't usually like making lazy blocking call implicit, as it tends to cause
> >> random issues. If we need to wait, we think it's better to wait int he module
> >> that is responsible, so in general, we try to negotiate and fallback locally
> >> (it's plugin base, so this can be really messy otherwise).
> >>
> >> So basically this problem needs to be solved in V4L2, VAAPI and other lower
> >> level APIs first. We need API that provides us these fence (in or out), and then
> >> we can consider using them. For V4L2, there was an attempt, but it was a bit of
> >> a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> >> solve some of other issues that was discussed. Notably for camera capture, were
> >> the HW timestamp is capture about at the same time the frame is ready. But the
> >> timestamp is not part of the paylaod, so you need an entire API asynchronously
> >> deliver that metadata. It's the biggest pain point I've found, such an API would
> >> be quite invasive or if made really generic, might just never be adopted widely
> >> enough.
> >
> > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > When you queue multiple buffers for camera capture for instance, you
> > don't know until capture complete in which buffer the frame has been
> > captured.
> 
> Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> start of frame capture which buffer it's getting written into.  I
> would think that the kernel APIs could be adjusted (if we find good
> reason to do so!) such that they return earlier and return a (buffer,
> fence) pair.  Am I missing something fundamental about video here?

For cameras I believe we could do that, yes. I was pointing out the
issues caused by the current API. For video decoders I'll let Nicolas
answer the question, he's way more knowledgeable that I am on that
topic.

> I must admit that V4L is a bit of an odd case since the kernel driver
> is the producer and not the consumer.

Note that V4L2 can be a consumer too. Video output with V4L2 is less
frequent than video capture (but it still exists), and codecs and other
memory-to-memory processing devices (colorspace converters, scalers,
...) are both consumers and producers.

> > In the normal case buffers are processed in sequence, but if
> > an error occurs during capture, they can be recycled internally and put
> > to the back of the queue.
> 
> Are those errors something that can happen at any time in the middle
> of a frame capture?  If so, that does make things stickier.

Yes it can. Think of packet loss when capturing from a USB webcam for
instance. 

> > Unless I'm mistaken, this problem also exists
> > with stateful codecs. And if you don't know in advance which buffer you
> > will receive from the device, the usefulness of fences becomes very
> > questionable :-)
> 
> Yeah, if you really are in a situation where there's no way to know
> until the full frame capture has been completed which buffer is next,
> then fences are useless.  You aren't in an implicit synchronization
> setting either; you're in a "full flush" setting.  It's arguably worse
> for performance but perhaps unavoidable?

Probably unavoidable in some cases, but nothing that should get in the
way for the discussion at hand: there's no need to migrate away from
implicit sync when there's implicit sync in the first place :-)

I think we need to analyse the use cases here, and figure out at least
guidelines for userspace, otherwise applications will wonder what
behaviour to implement, and we'll end up with a wide variety of them.
Even just on the kernel side, some V4L2 capture driver will pass
erroneous frames to userspace (thus guaranteeing ordering, but without
early notification of errors), some will require the frame
automatically, and at least one (uvcvideo) has a module parameter to
pick the desired behaviour.

> Trying to understand. :-)

So am I :-)

> >> There is other elements that would implement fencing, notably kmssink, but no
> >> one actually dared porting it to atomic KMS, so clearly there is very little
> >> comunity interest. glimagsink could clearly benifit. Right now if we import a
> >> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> >> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> >> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> >> uses), so then the operation will just fail where it worked before (breaking
> >> userspace). If it was an explcit fence, we could handle that in GStreamer
> >> cleanly as we do for new APIs.
> >>
> >>>> ## Chicken and egg problems
> >>>>
> >>>> Ok, this is where it starts getting depressing.  I made the claim
> >>>> above that Wayland has an explicit synchronization protocol that's of
> >>>> questionable usefulness.  I would claim that basically any bit of
> >>>> plumbing we do through window systems is currently of questionable
> >>>> usefulness.  Why?
> >>>>
> >>>> From my perspective, as a Vulkan driver developer, I have to deal with
> >>>> the fact that Vulkan is an explicit sync API but Wayland and X11
> >>>> aren't.  Unfortunately, the Wayland extension solves zero problems for
> >>>> me because I can't really use it unless it's implemented in all of the
> >>>> compositors.  Until every Wayland compositor I care about my users
> >>>> being able to use (which is basically all of them) supports the
> >>>> extension, I have to continue carry around my pile of hacks to keep
> >>>> implicit sync and Vulkan working nicely together.
> >>>>
> >>>> From the perspective of a Wayland compositor (I used to play in this
> >>>> space), they'd love to implement the new explicit sync extension but
> >>>> can't.  Sure, they could wire up the extension, but the moment they go
> >>>> to flip a client buffer to the screen directly, they discover that KMS
> >>>> doesn't support any explicit sync APIs.
> >>>
> >>> As per the above correction, Wayland compositors aren't nearly as bad
> >>> off as I initially thought.  There may still be weird screen capture
> >>> cases but the normal cases of compositing and displaying via
> >>> KMS/atomic should be in reasonably good shape.
> >>>
> >>>> So, yes, they can technically
> >>>> implement the extension assuming the EGL stack they're running on has
> >>>> the sync_file extensions but any client buffers which come in using
> >>>> the explicit sync Wayland extension have to be composited and can't be
> >>>> scanned out directly.  As a 3D driver developer, I absolutely don't
> >>>> want compositors doing that because my users will complain about
> >>>> performance issues due to the extra blit.
> >>>>
> >>>> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> >>>> all our problems, right?  It does, right up until someone decides that
> >>>> they wan to screen capture their Wayland session via some hardware
> >>>> media encoder that doesn't support explicit sync.  Now we have to
> >>>> plumb it all the way through the media stack, gstreamer, etc.  Great,
> >>>> so let's do that!  Oh, but gstreamer won't want to plumb it through
> >>>> until they're guaranteed that they can use explicit sync when
> >>>> displaying on X11 or Wayland.  Are you seeing the problem?
> >>>>
> >>>> To make matters worse, since most things are doing implicit
> >>>> synchronization today, it's really easy to get your explicit
> >>>> synchronization wrong and never notice.  If you forget to pass a
> >>>> sync_file into one place (say you never notice KMS doesn't support
> >>>> them), it will probably work anyway thanks to all the implicit sync
> >>>> that's going on elsewhere.
> >>>>
> >>>> So, clearly, we all need to go write piles of code that we can't
> >>>> actually properly test until everyone else has written their piece and
> >>>> then we use explicit sync if and only if all components support it.
> >>>> Really?  We're going to do multiple years of development and then just
> >>>> hope it works when we finally flip the switch?  That doesn't sound
> >>>> like a good plan to me.
> >>>>
> >>>>
> >>>> ## A proposal: Implicit and explicit sync together
> >>>>
> >>>> How to solve all these chicken-and-egg problems is something I've been
> >>>> giving quite a bit of thought (and talking with many others about) in
> >>>> the last couple of years.  One motivation for this is that we have to
> >>>> deal with a mismatch in Vulkan.  Another motivation is that I'm
> >>>> becoming increasingly unhappy with the way that synchronization,
> >>>> memory residency, and command submission are inherently intertwined in
> >>>> i915 and would like to break things apart.  Towards that end, I have
> >>>> an actual proposal.
> >>>>
> >>>> A couple weeks ago, I sent a series of patches to the dri-devel
> >>>> mailing list which adds a pair of new ioctls to dma-buf which allow
> >>>> userspace to manually import or export a sync_file from a dma-buf.
> >>>> The idea is that something like a Wayland compositor can switch to
> >>>> 100% explicit sync internally once the ioctl is available.  If it gets
> >>>> buffers in from a client that doesn't use the explicit sync extension,
> >>>> it can pull a sync_file from the dma-buf and use that exactly as it
> >>>> would a sync_file passed via the explicit sync extension.  When it
> >>>> goes to scan out a user buffer and discovers that KMS doesn't accept
> >>>> sync_files (or if it tries to use that pesky media encoder no one has
> >>>> converted), it can take it's sync_file for display and stuff it into
> >>>> the dma-buf before handing it to KMS.
> >>>>
> >>>> Along with the kernel patches, I've also implemented support for this
> >>>> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> >>>> only requirement on the Vulkan drivers is that you be able to export
> >>>> any VkSemaphore as a sync_file and temporarily import a sync_file into
> >>>> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> >>>> driver only ever sees explicit synchronization via sync_file.  The WSI
> >>>> code uses these new ioctls to translate the implicit sync of X11 and
> >>>> Wayland to the explicit sync the Vulkan driver wants.
> >>>>
> >>>> I'm hoping (and here's where I want a sanity check) that a simple API
> >>>> like this will allow us to finally start moving the Linux ecosystem
> >>>> over to explicit synchronization one piece at a time in a way that's
> >>>> actually correct.  (No Wayland explicit sync with compositors hoping
> >>>> KMS magically works even though it doesn't have a sync_file API.)
> >>>> Once some pieces in the ecosystem start moving, there will be
> >>>> motivation to start moving others and maybe we can actually build the
> >>>> momentum to get most everything converted.
> >>>>
> >>>> For reference, you can find the kernel RFC patches and mesa MR here:
> >>>>
> >>>> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> >>>>
> >>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> >>>>
> >>>> At this point, I welcome your thoughts, comments, objections, and
> >>>> maybe even help/review. :-)

-- 
Regards,

Laurent Pinchart
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 21:15           ` Laurent Pinchart
@ 2020-03-16 22:02             ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16 22:02 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Nicolas Dufresne, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Mon, Mar 16, 2020 at 4:15 PM Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> Hi Jason,
>
> On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > >> (I know I'm going to be spammed by so many mailing list ...)
> > >>
> > >> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > >>> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >>>> All,
> > >>>>
> > >>>> Sorry for casting such a broad net with this one. I'm sure most people
> > >>>> who reply will get at least one mailing list rejection.  However, this
> > >>>> is an issue that affects a LOT of components and that's why it's
> > >>>> thorny to begin with.  Please pardon the length of this e-mail as
> > >>>> well; I promise there's a concrete point/proposal at the end.
> > >>>>
> > >>>>
> > >>>> Explicit synchronization is the future of graphics and media.  At
> > >>>> least, that seems to be the consensus among all the graphics people
> > >>>> I've talked to.  I had a chat with one of the lead Android graphics
> > >>>> engineers recently who told me that doing explicit sync from the start
> > >>>> was one of the best engineering decisions Android ever made.  It's
> > >>>> also the direction being taken by more modern APIs such as Vulkan.
> > >>>>
> > >>>>
> > >>>> ## What are implicit and explicit synchronization?
> > >>>>
> > >>>> For those that aren't familiar with this space, GPUs, media encoders,
> > >>>> etc. are massively parallel and synchronization of some form is
> > >>>> required to ensure that everything happens in the right order and
> > >>>> avoid data races.  Implicit synchronization is when bits of work (3D,
> > >>>> compute, video encode, etc.) are implicitly based on the absolute
> > >>>> CPU-time order in which API calls occur.  Explicit synchronization is
> > >>>> when the client (whatever that means in any given context) provides
> > >>>> the dependency graph explicitly via some sort of synchronization
> > >>>> primitives.  If you're still confused, consider the following
> > >>>> examples:
> > >>>>
> > >>>> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > >>>> two OpenGL contexts sharing an image where one writes to it and the
> > >>>> other textures from it.  The way the OpenGL spec works, the client has
> > >>>> to make the API calls to render to the image before (in CPU time) it
> > >>>> makes the API calls which texture from the image.  As long as it does
> > >>>> this (and maybe inserts a glFlush?), the driver will ensure that the
> > >>>> rendering completes before the texturing happens and you get correct
> > >>>> contents.
> > >>>>
> > >>>> Implicit synchronization can also happen across processes.  Wayland,
> > >>>> for instance, is currently built on implicit sync where the client
> > >>>> does their rendering and then does a hand-off (via wl_surface::commit)
> > >>>> to tell the compositor it's done at which point the compositor can now
> > >>>> texture from the surface.  The hand-off ensures that the client's
> > >>>> OpenGL API calls happen before the server's OpenGL API calls.
> > >>>>
> > >>>> A good example of explicit synchronization is the Vulkan API.  There,
> > >>>> a client (or multiple clients) can simultaneously build command
> > >>>> buffers in different threads where one of those command buffers
> > >>>> renders to an image and the other textures from it and then submit
> > >>>> both of them at the same time with instructions to the driver for
> > >>>> which order to execute them in.  The execution order is described via
> > >>>> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > >>>> extension, you can even submit the work which does the texturing
> > >>>> BEFORE the work which does the rendering and the driver will sort it
> > >>>> out.
> > >>>>
> > >>>> The #1 problem with implicit synchronization (which explicit solves)
> > >>>> is that it leads to a lot of over-synchronization both in client space
> > >>>> and in driver/device space.  The client has to synchronize a lot more
> > >>>> because it has to ensure that the API calls happen in a particular
> > >>>> order.  The driver/device have to synchronize a lot more because they
> > >>>> never know what is going to end up being a synchronization point as an
> > >>>> API call on another thread/process may occur at any time.  As we move
> > >>>> to more and more multi-threaded programming this synchronization (on
> > >>>> the client-side especially) becomes more and more painful.
> > >>>>
> > >>>>
> > >>>> ## Current status in Linux
> > >>>>
> > >>>> Implicit synchronization in Linux works via a the kernel's internal
> > >>>> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > >>>> which represents the "done" status for some bit of work.  Typically,
> > >>>> dma_fences are created as a by-product of someone submitting some bit
> > >>>> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > >>>> set of dma_fences on it representing shared (read) and exclusive
> > >>>> (write) access to the object.  When work is submitted which, for
> > >>>> instance renders to the dma_buf, it's queued waiting on all the fences
> > >>>> on the dma_buf and and a dma_fence is created representing the end of
> > >>>> said rendering work and it's installed as the dma_buf's exclusive
> > >>>> fence.  This way, the kernel can manage all its internal queues (3D
> > >>>> rendering, display, video encode, etc.) and know which things to
> > >>>> submit in what order.
> > >>>>
> > >>>> For the last few years, we've had sync_file in the kernel and it's
> > >>>> plumbed into some drivers.  A sync_file is just a wrapper around a
> > >>>> single dma_fence.  A sync_file is typically created as a by-product of
> > >>>> submitting work (3D, compute, etc.) to the kernel and is signaled when
> > >>>> that work completes.  When a sync_file is created, it is guaranteed by
> > >>>> the kernel that it will become signaled in finite time and, once it's
> > >>>> signaled, it remains signaled for the rest of time.  A sync_file is
> > >>>> represented in UAPIs as a file descriptor and can be used with normal
> > >>>> file APIs such as dup().  It can be passed into another UAPI which
> > >>>> does some bit of queue'd work and the submitted work will wait for the
> > >>>> sync_file to be triggered before executing.  A sync_file also supports
> > >>>> poll() if  you want to wait on it manually.
> > >>>>
> > >>>> Unfortunately, sync_file is not broadly used and not all kernel GPU
> > >>>> drivers support it.  Here's a very quick overview of my understanding
> > >>>> of the status of various components (I don't know the status of
> > >>>> anything in the media world):
> > >>>>
> > >>>>  - Vulkan: Explicit synchronization all the way but we have to go
> > >>>> implicit as soon as we interact with a window-system.  Vulkan has APIs
> > >>>> to import/export sync_files to/from it's VkSemaphore and VkFence
> > >>>> synchronization primitives.
> > >>>>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > >>>> enable some forms of explicit sync via sync_file but OpenGL itself is
> > >>>> still implicit.
> > >>>>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > >>>> via EGL/OpenGL).  There is an unstable extension to allow passing
> > >>>> sync_files around but it's questionable how useful it is right now
> > >>>> (more on that later).
> > >>>>  - X11: With present, it has these "explicit" fence objects but
> > >>>> they're always a shmfence which lets the X server and client do a
> > >>>> userspace CPU-side hand-off without going over the socket (and
> > >>>> round-tripping through the kernel).  However, the only thing that
> > >>>> fence does is order the OpenGL API calls in the client and server and
> > >>>> the real synchronization is still implicit.
> > >>>>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > >>>> sync.
> > >>>>  - linux/amdgpu: Supports sync_file and syncobj but it still
> > >>>> implicitly syncs sometimes due to it's internal memory residency
> > >>>> handling which can lead to over-synchronization.
> > >>>>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > >>>> explicit sync primitives.
> > >>>
> > >>> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > >>> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > >>> are still in trouble but most Wayland compositors use atomic these
> > >>> days
> > >>>
> > >>>>  - v4l: ???
> > >>>>  - gstreamer: ???
> > >>>>  - Media APIs such as vaapi etc.:  ???
> > >>
> > >> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > >> synchronisation is something we do already with GL (even if limited). We place
> > >> GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > >> these GLSync as late as possible (or superseed the sync if we queue more work
> > >> into the same GL context). That requires a special mode of operation of course.
> > >> We don't usually like making lazy blocking call implicit, as it tends to cause
> > >> random issues. If we need to wait, we think it's better to wait int he module
> > >> that is responsible, so in general, we try to negotiate and fallback locally
> > >> (it's plugin base, so this can be really messy otherwise).
> > >>
> > >> So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > >> level APIs first. We need API that provides us these fence (in or out), and then
> > >> we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > >> a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > >> solve some of other issues that was discussed. Notably for camera capture, were
> > >> the HW timestamp is capture about at the same time the frame is ready. But the
> > >> timestamp is not part of the paylaod, so you need an entire API asynchronously
> > >> deliver that metadata. It's the biggest pain point I've found, such an API would
> > >> be quite invasive or if made really generic, might just never be adopted widely
> > >> enough.
> > >
> > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > When you queue multiple buffers for camera capture for instance, you
> > > don't know until capture complete in which buffer the frame has been
> > > captured.
> >
> > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > start of frame capture which buffer it's getting written into.  I
> > would think that the kernel APIs could be adjusted (if we find good
> > reason to do so!) such that they return earlier and return a (buffer,
> > fence) pair.  Am I missing something fundamental about video here?
>
> For cameras I believe we could do that, yes. I was pointing out the
> issues caused by the current API. For video decoders I'll let Nicolas
> answer the question, he's way more knowledgeable that I am on that
> topic.
>
> > I must admit that V4L is a bit of an odd case since the kernel driver
> > is the producer and not the consumer.
>
> Note that V4L2 can be a consumer too. Video output with V4L2 is less
> frequent than video capture (but it still exists), and codecs and other
> memory-to-memory processing devices (colorspace converters, scalers,
> ...) are both consumers and producers.

Yeah, I think I was aware of at least some of that.  I would expect
(though, again, I don't know) that the hardware which consumes images
generally shouldn't have those "packet loss" problems.  A video output
device might miss vblank but that's something we deal with in display
hardware all the time.  Codecs, I would hope, are reliable enough that
they should more-or-less always succeed.  Is this assumption correct?

> > > In the normal case buffers are processed in sequence, but if
> > > an error occurs during capture, they can be recycled internally and put
> > > to the back of the queue.
> >
> > Are those errors something that can happen at any time in the middle
> > of a frame capture?  If so, that does make things stickier.
>
> Yes it can. Think of packet loss when capturing from a USB webcam for
> instance.

Yeah, that makes sense.  In that case, there are likely going to be
some devices that either need to wait for the actual end-of-frame
before handing the buffer back to userspace or will need some sort of
out-of-band "ignore this frame, it's corrupted" error.  The later
sounds fairly painful for userspace to handle correctly.  Is this
"packet loss" something that all video devices experience or is it
mostly cheaper ones?

> > > Unless I'm mistaken, this problem also exists
> > > with stateful codecs. And if you don't know in advance which buffer you
> > > will receive from the device, the usefulness of fences becomes very
> > > questionable :-)
> >
> > Yeah, if you really are in a situation where there's no way to know
> > until the full frame capture has been completed which buffer is next,
> > then fences are useless.  You aren't in an implicit synchronization
> > setting either; you're in a "full flush" setting.  It's arguably worse
> > for performance but perhaps unavoidable?
>
> Probably unavoidable in some cases, but nothing that should get in the
> way for the discussion at hand: there's no need to migrate away from
> implicit sync when there's implicit sync in the first place :-)

Just to be clear, do you actually use the dma-buf implicit sync stuff
today or do all V4L capture devices wait until the full frame is
complete before returning anything to userspace?

> I think we need to analyse the use cases here, and figure out at least
> guidelines for userspace, otherwise applications will wonder what
> behaviour to implement, and we'll end up with a wide variety of them.

Yeah, there's some API design questions to be answered here.  It's
possible to have an image output API which always provides a sync_file
and, depending on the hardware, it may have one of two behaviors:

 1. Hand out images before the capture is done and trigger the sync
file once that frame's capture is completed
 2. Hand out images only after the full frame has been completed and
provide an already triggered sync_file

It would also be possible to make whether or not you get a sync_file
vs. implicit sync configurable from userspace (it kind-of has to be
opt-in since it would be a new UAPI) or to make it depend on the
underlying hardware.  This potentially makes userspace software more
complex which may make it harder to get right.  Lots of trade-offs
here.

> Even just on the kernel side, some V4L2 capture driver will pass
> erroneous frames to userspace (thus guaranteeing ordering, but without
> early notification of errors), some will require the frame
> automatically, and at least one (uvcvideo) has a module parameter to
> pick the desired behaviour.

Is passing erroneous frames to userspace current behavior?  Or are you
talking about what a sync_file future looks like?

--Jason


> > Trying to understand. :-)
>
> So am I :-)
>
> > >> There is other elements that would implement fencing, notably kmssink, but no
> > >> one actually dared porting it to atomic KMS, so clearly there is very little
> > >> comunity interest. glimagsink could clearly benifit. Right now if we import a
> > >> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > >> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > >> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > >> uses), so then the operation will just fail where it worked before (breaking
> > >> userspace). If it was an explcit fence, we could handle that in GStreamer
> > >> cleanly as we do for new APIs.
> > >>
> > >>>> ## Chicken and egg problems
> > >>>>
> > >>>> Ok, this is where it starts getting depressing.  I made the claim
> > >>>> above that Wayland has an explicit synchronization protocol that's of
> > >>>> questionable usefulness.  I would claim that basically any bit of
> > >>>> plumbing we do through window systems is currently of questionable
> > >>>> usefulness.  Why?
> > >>>>
> > >>>> From my perspective, as a Vulkan driver developer, I have to deal with
> > >>>> the fact that Vulkan is an explicit sync API but Wayland and X11
> > >>>> aren't.  Unfortunately, the Wayland extension solves zero problems for
> > >>>> me because I can't really use it unless it's implemented in all of the
> > >>>> compositors.  Until every Wayland compositor I care about my users
> > >>>> being able to use (which is basically all of them) supports the
> > >>>> extension, I have to continue carry around my pile of hacks to keep
> > >>>> implicit sync and Vulkan working nicely together.
> > >>>>
> > >>>> From the perspective of a Wayland compositor (I used to play in this
> > >>>> space), they'd love to implement the new explicit sync extension but
> > >>>> can't.  Sure, they could wire up the extension, but the moment they go
> > >>>> to flip a client buffer to the screen directly, they discover that KMS
> > >>>> doesn't support any explicit sync APIs.
> > >>>
> > >>> As per the above correction, Wayland compositors aren't nearly as bad
> > >>> off as I initially thought.  There may still be weird screen capture
> > >>> cases but the normal cases of compositing and displaying via
> > >>> KMS/atomic should be in reasonably good shape.
> > >>>
> > >>>> So, yes, they can technically
> > >>>> implement the extension assuming the EGL stack they're running on has
> > >>>> the sync_file extensions but any client buffers which come in using
> > >>>> the explicit sync Wayland extension have to be composited and can't be
> > >>>> scanned out directly.  As a 3D driver developer, I absolutely don't
> > >>>> want compositors doing that because my users will complain about
> > >>>> performance issues due to the extra blit.
> > >>>>
> > >>>> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > >>>> all our problems, right?  It does, right up until someone decides that
> > >>>> they wan to screen capture their Wayland session via some hardware
> > >>>> media encoder that doesn't support explicit sync.  Now we have to
> > >>>> plumb it all the way through the media stack, gstreamer, etc.  Great,
> > >>>> so let's do that!  Oh, but gstreamer won't want to plumb it through
> > >>>> until they're guaranteed that they can use explicit sync when
> > >>>> displaying on X11 or Wayland.  Are you seeing the problem?
> > >>>>
> > >>>> To make matters worse, since most things are doing implicit
> > >>>> synchronization today, it's really easy to get your explicit
> > >>>> synchronization wrong and never notice.  If you forget to pass a
> > >>>> sync_file into one place (say you never notice KMS doesn't support
> > >>>> them), it will probably work anyway thanks to all the implicit sync
> > >>>> that's going on elsewhere.
> > >>>>
> > >>>> So, clearly, we all need to go write piles of code that we can't
> > >>>> actually properly test until everyone else has written their piece and
> > >>>> then we use explicit sync if and only if all components support it.
> > >>>> Really?  We're going to do multiple years of development and then just
> > >>>> hope it works when we finally flip the switch?  That doesn't sound
> > >>>> like a good plan to me.
> > >>>>
> > >>>>
> > >>>> ## A proposal: Implicit and explicit sync together
> > >>>>
> > >>>> How to solve all these chicken-and-egg problems is something I've been
> > >>>> giving quite a bit of thought (and talking with many others about) in
> > >>>> the last couple of years.  One motivation for this is that we have to
> > >>>> deal with a mismatch in Vulkan.  Another motivation is that I'm
> > >>>> becoming increasingly unhappy with the way that synchronization,
> > >>>> memory residency, and command submission are inherently intertwined in
> > >>>> i915 and would like to break things apart.  Towards that end, I have
> > >>>> an actual proposal.
> > >>>>
> > >>>> A couple weeks ago, I sent a series of patches to the dri-devel
> > >>>> mailing list which adds a pair of new ioctls to dma-buf which allow
> > >>>> userspace to manually import or export a sync_file from a dma-buf.
> > >>>> The idea is that something like a Wayland compositor can switch to
> > >>>> 100% explicit sync internally once the ioctl is available.  If it gets
> > >>>> buffers in from a client that doesn't use the explicit sync extension,
> > >>>> it can pull a sync_file from the dma-buf and use that exactly as it
> > >>>> would a sync_file passed via the explicit sync extension.  When it
> > >>>> goes to scan out a user buffer and discovers that KMS doesn't accept
> > >>>> sync_files (or if it tries to use that pesky media encoder no one has
> > >>>> converted), it can take it's sync_file for display and stuff it into
> > >>>> the dma-buf before handing it to KMS.
> > >>>>
> > >>>> Along with the kernel patches, I've also implemented support for this
> > >>>> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > >>>> only requirement on the Vulkan drivers is that you be able to export
> > >>>> any VkSemaphore as a sync_file and temporarily import a sync_file into
> > >>>> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > >>>> driver only ever sees explicit synchronization via sync_file.  The WSI
> > >>>> code uses these new ioctls to translate the implicit sync of X11 and
> > >>>> Wayland to the explicit sync the Vulkan driver wants.
> > >>>>
> > >>>> I'm hoping (and here's where I want a sanity check) that a simple API
> > >>>> like this will allow us to finally start moving the Linux ecosystem
> > >>>> over to explicit synchronization one piece at a time in a way that's
> > >>>> actually correct.  (No Wayland explicit sync with compositors hoping
> > >>>> KMS magically works even though it doesn't have a sync_file API.)
> > >>>> Once some pieces in the ecosystem start moving, there will be
> > >>>> motivation to start moving others and maybe we can actually build the
> > >>>> momentum to get most everything converted.
> > >>>>
> > >>>> For reference, you can find the kernel RFC patches and mesa MR here:
> > >>>>
> > >>>> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > >>>>
> > >>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > >>>>
> > >>>> At this point, I welcome your thoughts, comments, objections, and
> > >>>> maybe even help/review. :-)
>
> --
> Regards,
>
> Laurent Pinchart

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 22:02             ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-16 22:02 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

On Mon, Mar 16, 2020 at 4:15 PM Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> Hi Jason,
>
> On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > >> (I know I'm going to be spammed by so many mailing list ...)
> > >>
> > >> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > >>> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >>>> All,
> > >>>>
> > >>>> Sorry for casting such a broad net with this one. I'm sure most people
> > >>>> who reply will get at least one mailing list rejection.  However, this
> > >>>> is an issue that affects a LOT of components and that's why it's
> > >>>> thorny to begin with.  Please pardon the length of this e-mail as
> > >>>> well; I promise there's a concrete point/proposal at the end.
> > >>>>
> > >>>>
> > >>>> Explicit synchronization is the future of graphics and media.  At
> > >>>> least, that seems to be the consensus among all the graphics people
> > >>>> I've talked to.  I had a chat with one of the lead Android graphics
> > >>>> engineers recently who told me that doing explicit sync from the start
> > >>>> was one of the best engineering decisions Android ever made.  It's
> > >>>> also the direction being taken by more modern APIs such as Vulkan.
> > >>>>
> > >>>>
> > >>>> ## What are implicit and explicit synchronization?
> > >>>>
> > >>>> For those that aren't familiar with this space, GPUs, media encoders,
> > >>>> etc. are massively parallel and synchronization of some form is
> > >>>> required to ensure that everything happens in the right order and
> > >>>> avoid data races.  Implicit synchronization is when bits of work (3D,
> > >>>> compute, video encode, etc.) are implicitly based on the absolute
> > >>>> CPU-time order in which API calls occur.  Explicit synchronization is
> > >>>> when the client (whatever that means in any given context) provides
> > >>>> the dependency graph explicitly via some sort of synchronization
> > >>>> primitives.  If you're still confused, consider the following
> > >>>> examples:
> > >>>>
> > >>>> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > >>>> two OpenGL contexts sharing an image where one writes to it and the
> > >>>> other textures from it.  The way the OpenGL spec works, the client has
> > >>>> to make the API calls to render to the image before (in CPU time) it
> > >>>> makes the API calls which texture from the image.  As long as it does
> > >>>> this (and maybe inserts a glFlush?), the driver will ensure that the
> > >>>> rendering completes before the texturing happens and you get correct
> > >>>> contents.
> > >>>>
> > >>>> Implicit synchronization can also happen across processes.  Wayland,
> > >>>> for instance, is currently built on implicit sync where the client
> > >>>> does their rendering and then does a hand-off (via wl_surface::commit)
> > >>>> to tell the compositor it's done at which point the compositor can now
> > >>>> texture from the surface.  The hand-off ensures that the client's
> > >>>> OpenGL API calls happen before the server's OpenGL API calls.
> > >>>>
> > >>>> A good example of explicit synchronization is the Vulkan API.  There,
> > >>>> a client (or multiple clients) can simultaneously build command
> > >>>> buffers in different threads where one of those command buffers
> > >>>> renders to an image and the other textures from it and then submit
> > >>>> both of them at the same time with instructions to the driver for
> > >>>> which order to execute them in.  The execution order is described via
> > >>>> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > >>>> extension, you can even submit the work which does the texturing
> > >>>> BEFORE the work which does the rendering and the driver will sort it
> > >>>> out.
> > >>>>
> > >>>> The #1 problem with implicit synchronization (which explicit solves)
> > >>>> is that it leads to a lot of over-synchronization both in client space
> > >>>> and in driver/device space.  The client has to synchronize a lot more
> > >>>> because it has to ensure that the API calls happen in a particular
> > >>>> order.  The driver/device have to synchronize a lot more because they
> > >>>> never know what is going to end up being a synchronization point as an
> > >>>> API call on another thread/process may occur at any time.  As we move
> > >>>> to more and more multi-threaded programming this synchronization (on
> > >>>> the client-side especially) becomes more and more painful.
> > >>>>
> > >>>>
> > >>>> ## Current status in Linux
> > >>>>
> > >>>> Implicit synchronization in Linux works via a the kernel's internal
> > >>>> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > >>>> which represents the "done" status for some bit of work.  Typically,
> > >>>> dma_fences are created as a by-product of someone submitting some bit
> > >>>> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > >>>> set of dma_fences on it representing shared (read) and exclusive
> > >>>> (write) access to the object.  When work is submitted which, for
> > >>>> instance renders to the dma_buf, it's queued waiting on all the fences
> > >>>> on the dma_buf and and a dma_fence is created representing the end of
> > >>>> said rendering work and it's installed as the dma_buf's exclusive
> > >>>> fence.  This way, the kernel can manage all its internal queues (3D
> > >>>> rendering, display, video encode, etc.) and know which things to
> > >>>> submit in what order.
> > >>>>
> > >>>> For the last few years, we've had sync_file in the kernel and it's
> > >>>> plumbed into some drivers.  A sync_file is just a wrapper around a
> > >>>> single dma_fence.  A sync_file is typically created as a by-product of
> > >>>> submitting work (3D, compute, etc.) to the kernel and is signaled when
> > >>>> that work completes.  When a sync_file is created, it is guaranteed by
> > >>>> the kernel that it will become signaled in finite time and, once it's
> > >>>> signaled, it remains signaled for the rest of time.  A sync_file is
> > >>>> represented in UAPIs as a file descriptor and can be used with normal
> > >>>> file APIs such as dup().  It can be passed into another UAPI which
> > >>>> does some bit of queue'd work and the submitted work will wait for the
> > >>>> sync_file to be triggered before executing.  A sync_file also supports
> > >>>> poll() if  you want to wait on it manually.
> > >>>>
> > >>>> Unfortunately, sync_file is not broadly used and not all kernel GPU
> > >>>> drivers support it.  Here's a very quick overview of my understanding
> > >>>> of the status of various components (I don't know the status of
> > >>>> anything in the media world):
> > >>>>
> > >>>>  - Vulkan: Explicit synchronization all the way but we have to go
> > >>>> implicit as soon as we interact with a window-system.  Vulkan has APIs
> > >>>> to import/export sync_files to/from it's VkSemaphore and VkFence
> > >>>> synchronization primitives.
> > >>>>  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > >>>> enable some forms of explicit sync via sync_file but OpenGL itself is
> > >>>> still implicit.
> > >>>>  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > >>>> via EGL/OpenGL).  There is an unstable extension to allow passing
> > >>>> sync_files around but it's questionable how useful it is right now
> > >>>> (more on that later).
> > >>>>  - X11: With present, it has these "explicit" fence objects but
> > >>>> they're always a shmfence which lets the X server and client do a
> > >>>> userspace CPU-side hand-off without going over the socket (and
> > >>>> round-tripping through the kernel).  However, the only thing that
> > >>>> fence does is order the OpenGL API calls in the client and server and
> > >>>> the real synchronization is still implicit.
> > >>>>  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > >>>> sync.
> > >>>>  - linux/amdgpu: Supports sync_file and syncobj but it still
> > >>>> implicitly syncs sometimes due to it's internal memory residency
> > >>>> handling which can lead to over-synchronization.
> > >>>>  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > >>>> explicit sync primitives.
> > >>>
> > >>> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > >>> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > >>> are still in trouble but most Wayland compositors use atomic these
> > >>> days
> > >>>
> > >>>>  - v4l: ???
> > >>>>  - gstreamer: ???
> > >>>>  - Media APIs such as vaapi etc.:  ???
> > >>
> > >> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > >> synchronisation is something we do already with GL (even if limited). We place
> > >> GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > >> these GLSync as late as possible (or superseed the sync if we queue more work
> > >> into the same GL context). That requires a special mode of operation of course.
> > >> We don't usually like making lazy blocking call implicit, as it tends to cause
> > >> random issues. If we need to wait, we think it's better to wait int he module
> > >> that is responsible, so in general, we try to negotiate and fallback locally
> > >> (it's plugin base, so this can be really messy otherwise).
> > >>
> > >> So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > >> level APIs first. We need API that provides us these fence (in or out), and then
> > >> we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > >> a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > >> solve some of other issues that was discussed. Notably for camera capture, were
> > >> the HW timestamp is capture about at the same time the frame is ready. But the
> > >> timestamp is not part of the paylaod, so you need an entire API asynchronously
> > >> deliver that metadata. It's the biggest pain point I've found, such an API would
> > >> be quite invasive or if made really generic, might just never be adopted widely
> > >> enough.
> > >
> > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > When you queue multiple buffers for camera capture for instance, you
> > > don't know until capture complete in which buffer the frame has been
> > > captured.
> >
> > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > start of frame capture which buffer it's getting written into.  I
> > would think that the kernel APIs could be adjusted (if we find good
> > reason to do so!) such that they return earlier and return a (buffer,
> > fence) pair.  Am I missing something fundamental about video here?
>
> For cameras I believe we could do that, yes. I was pointing out the
> issues caused by the current API. For video decoders I'll let Nicolas
> answer the question, he's way more knowledgeable that I am on that
> topic.
>
> > I must admit that V4L is a bit of an odd case since the kernel driver
> > is the producer and not the consumer.
>
> Note that V4L2 can be a consumer too. Video output with V4L2 is less
> frequent than video capture (but it still exists), and codecs and other
> memory-to-memory processing devices (colorspace converters, scalers,
> ...) are both consumers and producers.

Yeah, I think I was aware of at least some of that.  I would expect
(though, again, I don't know) that the hardware which consumes images
generally shouldn't have those "packet loss" problems.  A video output
device might miss vblank but that's something we deal with in display
hardware all the time.  Codecs, I would hope, are reliable enough that
they should more-or-less always succeed.  Is this assumption correct?

> > > In the normal case buffers are processed in sequence, but if
> > > an error occurs during capture, they can be recycled internally and put
> > > to the back of the queue.
> >
> > Are those errors something that can happen at any time in the middle
> > of a frame capture?  If so, that does make things stickier.
>
> Yes it can. Think of packet loss when capturing from a USB webcam for
> instance.

Yeah, that makes sense.  In that case, there are likely going to be
some devices that either need to wait for the actual end-of-frame
before handing the buffer back to userspace or will need some sort of
out-of-band "ignore this frame, it's corrupted" error.  The later
sounds fairly painful for userspace to handle correctly.  Is this
"packet loss" something that all video devices experience or is it
mostly cheaper ones?

> > > Unless I'm mistaken, this problem also exists
> > > with stateful codecs. And if you don't know in advance which buffer you
> > > will receive from the device, the usefulness of fences becomes very
> > > questionable :-)
> >
> > Yeah, if you really are in a situation where there's no way to know
> > until the full frame capture has been completed which buffer is next,
> > then fences are useless.  You aren't in an implicit synchronization
> > setting either; you're in a "full flush" setting.  It's arguably worse
> > for performance but perhaps unavoidable?
>
> Probably unavoidable in some cases, but nothing that should get in the
> way for the discussion at hand: there's no need to migrate away from
> implicit sync when there's implicit sync in the first place :-)

Just to be clear, do you actually use the dma-buf implicit sync stuff
today or do all V4L capture devices wait until the full frame is
complete before returning anything to userspace?

> I think we need to analyse the use cases here, and figure out at least
> guidelines for userspace, otherwise applications will wonder what
> behaviour to implement, and we'll end up with a wide variety of them.

Yeah, there's some API design questions to be answered here.  It's
possible to have an image output API which always provides a sync_file
and, depending on the hardware, it may have one of two behaviors:

 1. Hand out images before the capture is done and trigger the sync
file once that frame's capture is completed
 2. Hand out images only after the full frame has been completed and
provide an already triggered sync_file

It would also be possible to make whether or not you get a sync_file
vs. implicit sync configurable from userspace (it kind-of has to be
opt-in since it would be a new UAPI) or to make it depend on the
underlying hardware.  This potentially makes userspace software more
complex which may make it harder to get right.  Lots of trade-offs
here.

> Even just on the kernel side, some V4L2 capture driver will pass
> erroneous frames to userspace (thus guaranteeing ordering, but without
> early notification of errors), some will require the frame
> automatically, and at least one (uvcvideo) has a module parameter to
> pick the desired behaviour.

Is passing erroneous frames to userspace current behavior?  Or are you
talking about what a sync_file future looks like?

--Jason


> > Trying to understand. :-)
>
> So am I :-)
>
> > >> There is other elements that would implement fencing, notably kmssink, but no
> > >> one actually dared porting it to atomic KMS, so clearly there is very little
> > >> comunity interest. glimagsink could clearly benifit. Right now if we import a
> > >> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > >> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > >> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > >> uses), so then the operation will just fail where it worked before (breaking
> > >> userspace). If it was an explcit fence, we could handle that in GStreamer
> > >> cleanly as we do for new APIs.
> > >>
> > >>>> ## Chicken and egg problems
> > >>>>
> > >>>> Ok, this is where it starts getting depressing.  I made the claim
> > >>>> above that Wayland has an explicit synchronization protocol that's of
> > >>>> questionable usefulness.  I would claim that basically any bit of
> > >>>> plumbing we do through window systems is currently of questionable
> > >>>> usefulness.  Why?
> > >>>>
> > >>>> From my perspective, as a Vulkan driver developer, I have to deal with
> > >>>> the fact that Vulkan is an explicit sync API but Wayland and X11
> > >>>> aren't.  Unfortunately, the Wayland extension solves zero problems for
> > >>>> me because I can't really use it unless it's implemented in all of the
> > >>>> compositors.  Until every Wayland compositor I care about my users
> > >>>> being able to use (which is basically all of them) supports the
> > >>>> extension, I have to continue carry around my pile of hacks to keep
> > >>>> implicit sync and Vulkan working nicely together.
> > >>>>
> > >>>> From the perspective of a Wayland compositor (I used to play in this
> > >>>> space), they'd love to implement the new explicit sync extension but
> > >>>> can't.  Sure, they could wire up the extension, but the moment they go
> > >>>> to flip a client buffer to the screen directly, they discover that KMS
> > >>>> doesn't support any explicit sync APIs.
> > >>>
> > >>> As per the above correction, Wayland compositors aren't nearly as bad
> > >>> off as I initially thought.  There may still be weird screen capture
> > >>> cases but the normal cases of compositing and displaying via
> > >>> KMS/atomic should be in reasonably good shape.
> > >>>
> > >>>> So, yes, they can technically
> > >>>> implement the extension assuming the EGL stack they're running on has
> > >>>> the sync_file extensions but any client buffers which come in using
> > >>>> the explicit sync Wayland extension have to be composited and can't be
> > >>>> scanned out directly.  As a 3D driver developer, I absolutely don't
> > >>>> want compositors doing that because my users will complain about
> > >>>> performance issues due to the extra blit.
> > >>>>
> > >>>> Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > >>>> all our problems, right?  It does, right up until someone decides that
> > >>>> they wan to screen capture their Wayland session via some hardware
> > >>>> media encoder that doesn't support explicit sync.  Now we have to
> > >>>> plumb it all the way through the media stack, gstreamer, etc.  Great,
> > >>>> so let's do that!  Oh, but gstreamer won't want to plumb it through
> > >>>> until they're guaranteed that they can use explicit sync when
> > >>>> displaying on X11 or Wayland.  Are you seeing the problem?
> > >>>>
> > >>>> To make matters worse, since most things are doing implicit
> > >>>> synchronization today, it's really easy to get your explicit
> > >>>> synchronization wrong and never notice.  If you forget to pass a
> > >>>> sync_file into one place (say you never notice KMS doesn't support
> > >>>> them), it will probably work anyway thanks to all the implicit sync
> > >>>> that's going on elsewhere.
> > >>>>
> > >>>> So, clearly, we all need to go write piles of code that we can't
> > >>>> actually properly test until everyone else has written their piece and
> > >>>> then we use explicit sync if and only if all components support it.
> > >>>> Really?  We're going to do multiple years of development and then just
> > >>>> hope it works when we finally flip the switch?  That doesn't sound
> > >>>> like a good plan to me.
> > >>>>
> > >>>>
> > >>>> ## A proposal: Implicit and explicit sync together
> > >>>>
> > >>>> How to solve all these chicken-and-egg problems is something I've been
> > >>>> giving quite a bit of thought (and talking with many others about) in
> > >>>> the last couple of years.  One motivation for this is that we have to
> > >>>> deal with a mismatch in Vulkan.  Another motivation is that I'm
> > >>>> becoming increasingly unhappy with the way that synchronization,
> > >>>> memory residency, and command submission are inherently intertwined in
> > >>>> i915 and would like to break things apart.  Towards that end, I have
> > >>>> an actual proposal.
> > >>>>
> > >>>> A couple weeks ago, I sent a series of patches to the dri-devel
> > >>>> mailing list which adds a pair of new ioctls to dma-buf which allow
> > >>>> userspace to manually import or export a sync_file from a dma-buf.
> > >>>> The idea is that something like a Wayland compositor can switch to
> > >>>> 100% explicit sync internally once the ioctl is available.  If it gets
> > >>>> buffers in from a client that doesn't use the explicit sync extension,
> > >>>> it can pull a sync_file from the dma-buf and use that exactly as it
> > >>>> would a sync_file passed via the explicit sync extension.  When it
> > >>>> goes to scan out a user buffer and discovers that KMS doesn't accept
> > >>>> sync_files (or if it tries to use that pesky media encoder no one has
> > >>>> converted), it can take it's sync_file for display and stuff it into
> > >>>> the dma-buf before handing it to KMS.
> > >>>>
> > >>>> Along with the kernel patches, I've also implemented support for this
> > >>>> in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > >>>> only requirement on the Vulkan drivers is that you be able to export
> > >>>> any VkSemaphore as a sync_file and temporarily import a sync_file into
> > >>>> any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > >>>> driver only ever sees explicit synchronization via sync_file.  The WSI
> > >>>> code uses these new ioctls to translate the implicit sync of X11 and
> > >>>> Wayland to the explicit sync the Vulkan driver wants.
> > >>>>
> > >>>> I'm hoping (and here's where I want a sanity check) that a simple API
> > >>>> like this will allow us to finally start moving the Linux ecosystem
> > >>>> over to explicit synchronization one piece at a time in a way that's
> > >>>> actually correct.  (No Wayland explicit sync with compositors hoping
> > >>>> KMS magically works even though it doesn't have a sync_file API.)
> > >>>> Once some pieces in the ecosystem start moving, there will be
> > >>>> motivation to start moving others and maybe we can actually build the
> > >>>> momentum to get most everything converted.
> > >>>>
> > >>>> For reference, you can find the kernel RFC patches and mesa MR here:
> > >>>>
> > >>>> https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > >>>>
> > >>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > >>>>
> > >>>> At this point, I welcome your thoughts, comments, objections, and
> > >>>> maybe even help/review. :-)
>
> --
> Regards,
>
> Laurent Pinchart
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-11 19:21   ` Jason Ekstrand
@ 2020-03-16 23:41     ` Roman Gilg
  -1 siblings, 0 replies; 101+ messages in thread
From: Roman Gilg @ 2020-03-16 23:41 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Wed, Mar 11, 2020 at 8:21 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > All,
> >
> > Sorry for casting such a broad net with this one. I'm sure most people
> > who reply will get at least one mailing list rejection.  However, this
> > is an issue that affects a LOT of components and that's why it's
> > thorny to begin with.  Please pardon the length of this e-mail as
> > well; I promise there's a concrete point/proposal at the end.
> >
> >
> > Explicit synchronization is the future of graphics and media.  At
> > least, that seems to be the consensus among all the graphics people
> > I've talked to.  I had a chat with one of the lead Android graphics
> > engineers recently who told me that doing explicit sync from the start
> > was one of the best engineering decisions Android ever made.  It's
> > also the direction being taken by more modern APIs such as Vulkan.
> >
> >
> > ## What are implicit and explicit synchronization?
> >
> > For those that aren't familiar with this space, GPUs, media encoders,
> > etc. are massively parallel and synchronization of some form is
> > required to ensure that everything happens in the right order and
> > avoid data races.  Implicit synchronization is when bits of work (3D,
> > compute, video encode, etc.) are implicitly based on the absolute
> > CPU-time order in which API calls occur.  Explicit synchronization is
> > when the client (whatever that means in any given context) provides
> > the dependency graph explicitly via some sort of synchronization
> > primitives.  If you're still confused, consider the following
> > examples:
> >
> > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > two OpenGL contexts sharing an image where one writes to it and the
> > other textures from it.  The way the OpenGL spec works, the client has
> > to make the API calls to render to the image before (in CPU time) it
> > makes the API calls which texture from the image.  As long as it does
> > this (and maybe inserts a glFlush?), the driver will ensure that the
> > rendering completes before the texturing happens and you get correct
> > contents.
> >
> > Implicit synchronization can also happen across processes.  Wayland,
> > for instance, is currently built on implicit sync where the client
> > does their rendering and then does a hand-off (via wl_surface::commit)
> > to tell the compositor it's done at which point the compositor can now
> > texture from the surface.  The hand-off ensures that the client's
> > OpenGL API calls happen before the server's OpenGL API calls.
> >
> > A good example of explicit synchronization is the Vulkan API.  There,
> > a client (or multiple clients) can simultaneously build command
> > buffers in different threads where one of those command buffers
> > renders to an image and the other textures from it and then submit
> > both of them at the same time with instructions to the driver for
> > which order to execute them in.  The execution order is described via
> > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > extension, you can even submit the work which does the texturing
> > BEFORE the work which does the rendering and the driver will sort it
> > out.
> >
> > The #1 problem with implicit synchronization (which explicit solves)
> > is that it leads to a lot of over-synchronization both in client space
> > and in driver/device space.  The client has to synchronize a lot more
> > because it has to ensure that the API calls happen in a particular
> > order.  The driver/device have to synchronize a lot more because they
> > never know what is going to end up being a synchronization point as an
> > API call on another thread/process may occur at any time.  As we move
> > to more and more multi-threaded programming this synchronization (on
> > the client-side especially) becomes more and more painful.
> >
> >
> > ## Current status in Linux
> >
> > Implicit synchronization in Linux works via a the kernel's internal
> > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > which represents the "done" status for some bit of work.  Typically,
> > dma_fences are created as a by-product of someone submitting some bit
> > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > set of dma_fences on it representing shared (read) and exclusive
> > (write) access to the object.  When work is submitted which, for
> > instance renders to the dma_buf, it's queued waiting on all the fences
> > on the dma_buf and and a dma_fence is created representing the end of
> > said rendering work and it's installed as the dma_buf's exclusive
> > fence.  This way, the kernel can manage all its internal queues (3D
> > rendering, display, video encode, etc.) and know which things to
> > submit in what order.
> >
> > For the last few years, we've had sync_file in the kernel and it's
> > plumbed into some drivers.  A sync_file is just a wrapper around a
> > single dma_fence.  A sync_file is typically created as a by-product of
> > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > that work completes.  When a sync_file is created, it is guaranteed by
> > the kernel that it will become signaled in finite time and, once it's
> > signaled, it remains signaled for the rest of time.  A sync_file is
> > represented in UAPIs as a file descriptor and can be used with normal
> > file APIs such as dup().  It can be passed into another UAPI which
> > does some bit of queue'd work and the submitted work will wait for the
> > sync_file to be triggered before executing.  A sync_file also supports
> > poll() if  you want to wait on it manually.
> >
> > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > drivers support it.  Here's a very quick overview of my understanding
> > of the status of various components (I don't know the status of
> > anything in the media world):
> >
> >  - Vulkan: Explicit synchronization all the way but we have to go
> > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > to import/export sync_files to/from it's VkSemaphore and VkFence
> > synchronization primitives.
> >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > enable some forms of explicit sync via sync_file but OpenGL itself is
> > still implicit.
> >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > via EGL/OpenGL).  There is an unstable extension to allow passing
> > sync_files around but it's questionable how useful it is right now
> > (more on that later).
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
> >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > implicitly syncs sometimes due to it's internal memory residency
> > handling which can lead to over-synchronization.
> >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > explicit sync primitives.
>
> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> are still in trouble but most Wayland compositors use atomic these
> days

Hi Jason,

thanks for pushing this forward and the comprehensive explanation on
what it is about.

My question would be what exactly do you now need from Wayland compositor devs?
I understood a Wayland compositor needs to:
* do atomic page flips,
* support [1].

Is there something else? You described a mechanism to pull out and
push in these sync_files to dma-bufs depending on what the client
provides and what kind of output the compositor puts the final image
onto. That's for now just an idea (plus your wip implementation in
Vulkan/kernel) and there is not yet anything that can be done for this
specifically in Wayland compositors, or is there?

Thanks
Roman

[1] https://gitlab.freedesktop.org/wayland/wayland-protocols/blob/master/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml


> >  - v4l: ???
> >  - gstreamer: ???
> >  - Media APIs such as vaapi etc.:  ???
> >
> >
> > ## Chicken and egg problems
> >
> > Ok, this is where it starts getting depressing.  I made the claim
> > above that Wayland has an explicit synchronization protocol that's of
> > questionable usefulness.  I would claim that basically any bit of
> > plumbing we do through window systems is currently of questionable
> > usefulness.  Why?
> >
> > From my perspective, as a Vulkan driver developer, I have to deal with
> > the fact that Vulkan is an explicit sync API but Wayland and X11
> > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > me because I can't really use it unless it's implemented in all of the
> > compositors.  Until every Wayland compositor I care about my users
> > being able to use (which is basically all of them) supports the
> > extension, I have to continue carry around my pile of hacks to keep
> > implicit sync and Vulkan working nicely together.
> >
> > From the perspective of a Wayland compositor (I used to play in this
> > space), they'd love to implement the new explicit sync extension but
> > can't.  Sure, they could wire up the extension, but the moment they go
> > to flip a client buffer to the screen directly, they discover that KMS
> > doesn't support any explicit sync APIs.
>
> As per the above correction, Wayland compositors aren't nearly as bad
> off as I initially thought.  There may still be weird screen capture
> cases but the normal cases of compositing and displaying via
> KMS/atomic should be in reasonably good shape.
>
> > So, yes, they can technically
> > implement the extension assuming the EGL stack they're running on has
> > the sync_file extensions but any client buffers which come in using
> > the explicit sync Wayland extension have to be composited and can't be
> > scanned out directly.  As a 3D driver developer, I absolutely don't
> > want compositors doing that because my users will complain about
> > performance issues due to the extra blit.
> >
> > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > all our problems, right?  It does, right up until someone decides that
> > they wan to screen capture their Wayland session via some hardware
> > media encoder that doesn't support explicit sync.  Now we have to
> > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > until they're guaranteed that they can use explicit sync when
> > displaying on X11 or Wayland.  Are you seeing the problem?
> >
> > To make matters worse, since most things are doing implicit
> > synchronization today, it's really easy to get your explicit
> > synchronization wrong and never notice.  If you forget to pass a
> > sync_file into one place (say you never notice KMS doesn't support
> > them), it will probably work anyway thanks to all the implicit sync
> > that's going on elsewhere.
> >
> > So, clearly, we all need to go write piles of code that we can't
> > actually properly test until everyone else has written their piece and
> > then we use explicit sync if and only if all components support it.
> > Really?  We're going to do multiple years of development and then just
> > hope it works when we finally flip the switch?  That doesn't sound
> > like a good plan to me.
> >
> >
> > ## A proposal: Implicit and explicit sync together
> >
> > How to solve all these chicken-and-egg problems is something I've been
> > giving quite a bit of thought (and talking with many others about) in
> > the last couple of years.  One motivation for this is that we have to
> > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > becoming increasingly unhappy with the way that synchronization,
> > memory residency, and command submission are inherently intertwined in
> > i915 and would like to break things apart.  Towards that end, I have
> > an actual proposal.
> >
> > A couple weeks ago, I sent a series of patches to the dri-devel
> > mailing list which adds a pair of new ioctls to dma-buf which allow
> > userspace to manually import or export a sync_file from a dma-buf.
> > The idea is that something like a Wayland compositor can switch to
> > 100% explicit sync internally once the ioctl is available.  If it gets
> > buffers in from a client that doesn't use the explicit sync extension,
> > it can pull a sync_file from the dma-buf and use that exactly as it
> > would a sync_file passed via the explicit sync extension.  When it
> > goes to scan out a user buffer and discovers that KMS doesn't accept
> > sync_files (or if it tries to use that pesky media encoder no one has
> > converted), it can take it's sync_file for display and stuff it into
> > the dma-buf before handing it to KMS.
> >
> > Along with the kernel patches, I've also implemented support for this
> > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > only requirement on the Vulkan drivers is that you be able to export
> > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > driver only ever sees explicit synchronization via sync_file.  The WSI
> > code uses these new ioctls to translate the implicit sync of X11 and
> > Wayland to the explicit sync the Vulkan driver wants.
> >
> > I'm hoping (and here's where I want a sanity check) that a simple API
> > like this will allow us to finally start moving the Linux ecosystem
> > over to explicit synchronization one piece at a time in a way that's
> > actually correct.  (No Wayland explicit sync with compositors hoping
> > KMS magically works even though it doesn't have a sync_file API.)
> > Once some pieces in the ecosystem start moving, there will be
> > motivation to start moving others and maybe we can actually build the
> > momentum to get most everything converted.
> >
> > For reference, you can find the kernel RFC patches and mesa MR here:
> >
> > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> >
> > At this point, I welcome your thoughts, comments, objections, and
> > maybe even help/review. :-)
> >
> > --Jason Ekstrand
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-16 23:41     ` Roman Gilg
  0 siblings, 0 replies; 101+ messages in thread
From: Roman Gilg @ 2020-03-16 23:41 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

On Wed, Mar 11, 2020 at 8:21 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > All,
> >
> > Sorry for casting such a broad net with this one. I'm sure most people
> > who reply will get at least one mailing list rejection.  However, this
> > is an issue that affects a LOT of components and that's why it's
> > thorny to begin with.  Please pardon the length of this e-mail as
> > well; I promise there's a concrete point/proposal at the end.
> >
> >
> > Explicit synchronization is the future of graphics and media.  At
> > least, that seems to be the consensus among all the graphics people
> > I've talked to.  I had a chat with one of the lead Android graphics
> > engineers recently who told me that doing explicit sync from the start
> > was one of the best engineering decisions Android ever made.  It's
> > also the direction being taken by more modern APIs such as Vulkan.
> >
> >
> > ## What are implicit and explicit synchronization?
> >
> > For those that aren't familiar with this space, GPUs, media encoders,
> > etc. are massively parallel and synchronization of some form is
> > required to ensure that everything happens in the right order and
> > avoid data races.  Implicit synchronization is when bits of work (3D,
> > compute, video encode, etc.) are implicitly based on the absolute
> > CPU-time order in which API calls occur.  Explicit synchronization is
> > when the client (whatever that means in any given context) provides
> > the dependency graph explicitly via some sort of synchronization
> > primitives.  If you're still confused, consider the following
> > examples:
> >
> > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > two OpenGL contexts sharing an image where one writes to it and the
> > other textures from it.  The way the OpenGL spec works, the client has
> > to make the API calls to render to the image before (in CPU time) it
> > makes the API calls which texture from the image.  As long as it does
> > this (and maybe inserts a glFlush?), the driver will ensure that the
> > rendering completes before the texturing happens and you get correct
> > contents.
> >
> > Implicit synchronization can also happen across processes.  Wayland,
> > for instance, is currently built on implicit sync where the client
> > does their rendering and then does a hand-off (via wl_surface::commit)
> > to tell the compositor it's done at which point the compositor can now
> > texture from the surface.  The hand-off ensures that the client's
> > OpenGL API calls happen before the server's OpenGL API calls.
> >
> > A good example of explicit synchronization is the Vulkan API.  There,
> > a client (or multiple clients) can simultaneously build command
> > buffers in different threads where one of those command buffers
> > renders to an image and the other textures from it and then submit
> > both of them at the same time with instructions to the driver for
> > which order to execute them in.  The execution order is described via
> > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > extension, you can even submit the work which does the texturing
> > BEFORE the work which does the rendering and the driver will sort it
> > out.
> >
> > The #1 problem with implicit synchronization (which explicit solves)
> > is that it leads to a lot of over-synchronization both in client space
> > and in driver/device space.  The client has to synchronize a lot more
> > because it has to ensure that the API calls happen in a particular
> > order.  The driver/device have to synchronize a lot more because they
> > never know what is going to end up being a synchronization point as an
> > API call on another thread/process may occur at any time.  As we move
> > to more and more multi-threaded programming this synchronization (on
> > the client-side especially) becomes more and more painful.
> >
> >
> > ## Current status in Linux
> >
> > Implicit synchronization in Linux works via a the kernel's internal
> > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > which represents the "done" status for some bit of work.  Typically,
> > dma_fences are created as a by-product of someone submitting some bit
> > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > set of dma_fences on it representing shared (read) and exclusive
> > (write) access to the object.  When work is submitted which, for
> > instance renders to the dma_buf, it's queued waiting on all the fences
> > on the dma_buf and and a dma_fence is created representing the end of
> > said rendering work and it's installed as the dma_buf's exclusive
> > fence.  This way, the kernel can manage all its internal queues (3D
> > rendering, display, video encode, etc.) and know which things to
> > submit in what order.
> >
> > For the last few years, we've had sync_file in the kernel and it's
> > plumbed into some drivers.  A sync_file is just a wrapper around a
> > single dma_fence.  A sync_file is typically created as a by-product of
> > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > that work completes.  When a sync_file is created, it is guaranteed by
> > the kernel that it will become signaled in finite time and, once it's
> > signaled, it remains signaled for the rest of time.  A sync_file is
> > represented in UAPIs as a file descriptor and can be used with normal
> > file APIs such as dup().  It can be passed into another UAPI which
> > does some bit of queue'd work and the submitted work will wait for the
> > sync_file to be triggered before executing.  A sync_file also supports
> > poll() if  you want to wait on it manually.
> >
> > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > drivers support it.  Here's a very quick overview of my understanding
> > of the status of various components (I don't know the status of
> > anything in the media world):
> >
> >  - Vulkan: Explicit synchronization all the way but we have to go
> > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > to import/export sync_files to/from it's VkSemaphore and VkFence
> > synchronization primitives.
> >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > enable some forms of explicit sync via sync_file but OpenGL itself is
> > still implicit.
> >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > via EGL/OpenGL).  There is an unstable extension to allow passing
> > sync_files around but it's questionable how useful it is right now
> > (more on that later).
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
> >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > implicitly syncs sometimes due to it's internal memory residency
> > handling which can lead to over-synchronization.
> >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > explicit sync primitives.
>
> Correction:  Apparently, I missed some things.  If you use atomic, KMS
> does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> are still in trouble but most Wayland compositors use atomic these
> days

Hi Jason,

thanks for pushing this forward and the comprehensive explanation on
what it is about.

My question would be what exactly do you now need from Wayland compositor devs?
I understood a Wayland compositor needs to:
* do atomic page flips,
* support [1].

Is there something else? You described a mechanism to pull out and
push in these sync_files to dma-bufs depending on what the client
provides and what kind of output the compositor puts the final image
onto. That's for now just an idea (plus your wip implementation in
Vulkan/kernel) and there is not yet anything that can be done for this
specifically in Wayland compositors, or is there?

Thanks
Roman

[1] https://gitlab.freedesktop.org/wayland/wayland-protocols/blob/master/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml


> >  - v4l: ???
> >  - gstreamer: ???
> >  - Media APIs such as vaapi etc.:  ???
> >
> >
> > ## Chicken and egg problems
> >
> > Ok, this is where it starts getting depressing.  I made the claim
> > above that Wayland has an explicit synchronization protocol that's of
> > questionable usefulness.  I would claim that basically any bit of
> > plumbing we do through window systems is currently of questionable
> > usefulness.  Why?
> >
> > From my perspective, as a Vulkan driver developer, I have to deal with
> > the fact that Vulkan is an explicit sync API but Wayland and X11
> > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > me because I can't really use it unless it's implemented in all of the
> > compositors.  Until every Wayland compositor I care about my users
> > being able to use (which is basically all of them) supports the
> > extension, I have to continue carry around my pile of hacks to keep
> > implicit sync and Vulkan working nicely together.
> >
> > From the perspective of a Wayland compositor (I used to play in this
> > space), they'd love to implement the new explicit sync extension but
> > can't.  Sure, they could wire up the extension, but the moment they go
> > to flip a client buffer to the screen directly, they discover that KMS
> > doesn't support any explicit sync APIs.
>
> As per the above correction, Wayland compositors aren't nearly as bad
> off as I initially thought.  There may still be weird screen capture
> cases but the normal cases of compositing and displaying via
> KMS/atomic should be in reasonably good shape.
>
> > So, yes, they can technically
> > implement the extension assuming the EGL stack they're running on has
> > the sync_file extensions but any client buffers which come in using
> > the explicit sync Wayland extension have to be composited and can't be
> > scanned out directly.  As a 3D driver developer, I absolutely don't
> > want compositors doing that because my users will complain about
> > performance issues due to the extra blit.
> >
> > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > all our problems, right?  It does, right up until someone decides that
> > they wan to screen capture their Wayland session via some hardware
> > media encoder that doesn't support explicit sync.  Now we have to
> > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > until they're guaranteed that they can use explicit sync when
> > displaying on X11 or Wayland.  Are you seeing the problem?
> >
> > To make matters worse, since most things are doing implicit
> > synchronization today, it's really easy to get your explicit
> > synchronization wrong and never notice.  If you forget to pass a
> > sync_file into one place (say you never notice KMS doesn't support
> > them), it will probably work anyway thanks to all the implicit sync
> > that's going on elsewhere.
> >
> > So, clearly, we all need to go write piles of code that we can't
> > actually properly test until everyone else has written their piece and
> > then we use explicit sync if and only if all components support it.
> > Really?  We're going to do multiple years of development and then just
> > hope it works when we finally flip the switch?  That doesn't sound
> > like a good plan to me.
> >
> >
> > ## A proposal: Implicit and explicit sync together
> >
> > How to solve all these chicken-and-egg problems is something I've been
> > giving quite a bit of thought (and talking with many others about) in
> > the last couple of years.  One motivation for this is that we have to
> > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > becoming increasingly unhappy with the way that synchronization,
> > memory residency, and command submission are inherently intertwined in
> > i915 and would like to break things apart.  Towards that end, I have
> > an actual proposal.
> >
> > A couple weeks ago, I sent a series of patches to the dri-devel
> > mailing list which adds a pair of new ioctls to dma-buf which allow
> > userspace to manually import or export a sync_file from a dma-buf.
> > The idea is that something like a Wayland compositor can switch to
> > 100% explicit sync internally once the ioctl is available.  If it gets
> > buffers in from a client that doesn't use the explicit sync extension,
> > it can pull a sync_file from the dma-buf and use that exactly as it
> > would a sync_file passed via the explicit sync extension.  When it
> > goes to scan out a user buffer and discovers that KMS doesn't accept
> > sync_files (or if it tries to use that pesky media encoder no one has
> > converted), it can take it's sync_file for display and stuff it into
> > the dma-buf before handing it to KMS.
> >
> > Along with the kernel patches, I've also implemented support for this
> > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > only requirement on the Vulkan drivers is that you be able to export
> > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > driver only ever sees explicit synchronization via sync_file.  The WSI
> > code uses these new ioctls to translate the implicit sync of X11 and
> > Wayland to the explicit sync the Vulkan driver wants.
> >
> > I'm hoping (and here's where I want a sanity check) that a simple API
> > like this will allow us to finally start moving the Linux ecosystem
> > over to explicit synchronization one piece at a time in a way that's
> > actually correct.  (No Wayland explicit sync with compositors hoping
> > KMS magically works even though it doesn't have a sync_file API.)
> > Once some pieces in the ecosystem start moving, there will be
> > motivation to start moving others and maybe we can actually build the
> > momentum to get most everything converted.
> >
> > For reference, you can find the kernel RFC patches and mesa MR here:
> >
> > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> >
> > At this point, I welcome your thoughts, comments, objections, and
> > maybe even help/review. :-)
> >
> > --Jason Ekstrand
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 23:41     ` Roman Gilg
@ 2020-03-17  3:37       ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17  3:37 UTC (permalink / raw)
  To: Roman Gilg
  Cc: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Mon, Mar 16, 2020 at 6:39 PM Roman Gilg <subdiff@gmail.com> wrote:
>
> On Wed, Mar 11, 2020 at 8:21 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > All,
> > >
> > > Sorry for casting such a broad net with this one. I'm sure most people
> > > who reply will get at least one mailing list rejection.  However, this
> > > is an issue that affects a LOT of components and that's why it's
> > > thorny to begin with.  Please pardon the length of this e-mail as
> > > well; I promise there's a concrete point/proposal at the end.
> > >
> > >
> > > Explicit synchronization is the future of graphics and media.  At
> > > least, that seems to be the consensus among all the graphics people
> > > I've talked to.  I had a chat with one of the lead Android graphics
> > > engineers recently who told me that doing explicit sync from the start
> > > was one of the best engineering decisions Android ever made.  It's
> > > also the direction being taken by more modern APIs such as Vulkan.
> > >
> > >
> > > ## What are implicit and explicit synchronization?
> > >
> > > For those that aren't familiar with this space, GPUs, media encoders,
> > > etc. are massively parallel and synchronization of some form is
> > > required to ensure that everything happens in the right order and
> > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > compute, video encode, etc.) are implicitly based on the absolute
> > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > when the client (whatever that means in any given context) provides
> > > the dependency graph explicitly via some sort of synchronization
> > > primitives.  If you're still confused, consider the following
> > > examples:
> > >
> > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > two OpenGL contexts sharing an image where one writes to it and the
> > > other textures from it.  The way the OpenGL spec works, the client has
> > > to make the API calls to render to the image before (in CPU time) it
> > > makes the API calls which texture from the image.  As long as it does
> > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > rendering completes before the texturing happens and you get correct
> > > contents.
> > >
> > > Implicit synchronization can also happen across processes.  Wayland,
> > > for instance, is currently built on implicit sync where the client
> > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > to tell the compositor it's done at which point the compositor can now
> > > texture from the surface.  The hand-off ensures that the client's
> > > OpenGL API calls happen before the server's OpenGL API calls.
> > >
> > > A good example of explicit synchronization is the Vulkan API.  There,
> > > a client (or multiple clients) can simultaneously build command
> > > buffers in different threads where one of those command buffers
> > > renders to an image and the other textures from it and then submit
> > > both of them at the same time with instructions to the driver for
> > > which order to execute them in.  The execution order is described via
> > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > extension, you can even submit the work which does the texturing
> > > BEFORE the work which does the rendering and the driver will sort it
> > > out.
> > >
> > > The #1 problem with implicit synchronization (which explicit solves)
> > > is that it leads to a lot of over-synchronization both in client space
> > > and in driver/device space.  The client has to synchronize a lot more
> > > because it has to ensure that the API calls happen in a particular
> > > order.  The driver/device have to synchronize a lot more because they
> > > never know what is going to end up being a synchronization point as an
> > > API call on another thread/process may occur at any time.  As we move
> > > to more and more multi-threaded programming this synchronization (on
> > > the client-side especially) becomes more and more painful.
> > >
> > >
> > > ## Current status in Linux
> > >
> > > Implicit synchronization in Linux works via a the kernel's internal
> > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > which represents the "done" status for some bit of work.  Typically,
> > > dma_fences are created as a by-product of someone submitting some bit
> > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > set of dma_fences on it representing shared (read) and exclusive
> > > (write) access to the object.  When work is submitted which, for
> > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > on the dma_buf and and a dma_fence is created representing the end of
> > > said rendering work and it's installed as the dma_buf's exclusive
> > > fence.  This way, the kernel can manage all its internal queues (3D
> > > rendering, display, video encode, etc.) and know which things to
> > > submit in what order.
> > >
> > > For the last few years, we've had sync_file in the kernel and it's
> > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > single dma_fence.  A sync_file is typically created as a by-product of
> > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > that work completes.  When a sync_file is created, it is guaranteed by
> > > the kernel that it will become signaled in finite time and, once it's
> > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > represented in UAPIs as a file descriptor and can be used with normal
> > > file APIs such as dup().  It can be passed into another UAPI which
> > > does some bit of queue'd work and the submitted work will wait for the
> > > sync_file to be triggered before executing.  A sync_file also supports
> > > poll() if  you want to wait on it manually.
> > >
> > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > drivers support it.  Here's a very quick overview of my understanding
> > > of the status of various components (I don't know the status of
> > > anything in the media world):
> > >
> > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > synchronization primitives.
> > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > still implicit.
> > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > sync_files around but it's questionable how useful it is right now
> > > (more on that later).
> > >  - X11: With present, it has these "explicit" fence objects but
> > > they're always a shmfence which lets the X server and client do a
> > > userspace CPU-side hand-off without going over the socket (and
> > > round-tripping through the kernel).  However, the only thing that
> > > fence does is order the OpenGL API calls in the client and server and
> > > the real synchronization is still implicit.
> > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > implicitly syncs sometimes due to it's internal memory residency
> > > handling which can lead to over-synchronization.
> > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > explicit sync primitives.
> >
> > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > are still in trouble but most Wayland compositors use atomic these
> > days
>
> Hi Jason,
>
> thanks for pushing this forward and the comprehensive explanation on
> what it is about.
>
> My question would be what exactly do you now need from Wayland compositor devs?
> I understood a Wayland compositor needs to:
> * do atomic page flips,
> * support [1].

Yup, that's pretty much what's needed.  From the looks of
https://gitlab.gnome.org/GNOME/mutter/issues/548, it appears that
mutter is at least on their way to atomic though it also looks like a
long road.

> Is there something else? You described a mechanism to pull out and
> push in these sync_files to dma-bufs depending on what the client
> provides and what kind of output the compositor puts the final image
> onto. That's for now just an idea (plus your wip implementation in
> Vulkan/kernel) and there is not yet anything that can be done for this
> specifically in Wayland compositors, or is there?

The kernel patches I proposed are mostly a way for explicit-sync APIs
such as Vulkan to reasonably interop with the implicit sync world.
Right now, for instance, we have no real plan for Vulkan to be able to
handle synchronization when talking to a media encoder.  We have the
modifiers and dma-buf stuff which deals with image layout but
synchronization is currently an unsolved problem.  If we have a
mechanism for sync_file import/export from dma-buf then an app can use
implicit sync when talking to VAAPI (as an example) and turn that into
sync_files to talk to Vulkan.  Better yet, we could plumb VAAPI for
explicit sync but that's yet one more thing that needs to support
explicit sync.

Make sense?

--Jason


> Thanks
> Roman
>
> [1] https://gitlab.freedesktop.org/wayland/wayland-protocols/blob/master/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml
>
>
> > >  - v4l: ???
> > >  - gstreamer: ???
> > >  - Media APIs such as vaapi etc.:  ???
> > >
> > >
> > > ## Chicken and egg problems
> > >
> > > Ok, this is where it starts getting depressing.  I made the claim
> > > above that Wayland has an explicit synchronization protocol that's of
> > > questionable usefulness.  I would claim that basically any bit of
> > > plumbing we do through window systems is currently of questionable
> > > usefulness.  Why?
> > >
> > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > me because I can't really use it unless it's implemented in all of the
> > > compositors.  Until every Wayland compositor I care about my users
> > > being able to use (which is basically all of them) supports the
> > > extension, I have to continue carry around my pile of hacks to keep
> > > implicit sync and Vulkan working nicely together.
> > >
> > > From the perspective of a Wayland compositor (I used to play in this
> > > space), they'd love to implement the new explicit sync extension but
> > > can't.  Sure, they could wire up the extension, but the moment they go
> > > to flip a client buffer to the screen directly, they discover that KMS
> > > doesn't support any explicit sync APIs.
> >
> > As per the above correction, Wayland compositors aren't nearly as bad
> > off as I initially thought.  There may still be weird screen capture
> > cases but the normal cases of compositing and displaying via
> > KMS/atomic should be in reasonably good shape.
> >
> > > So, yes, they can technically
> > > implement the extension assuming the EGL stack they're running on has
> > > the sync_file extensions but any client buffers which come in using
> > > the explicit sync Wayland extension have to be composited and can't be
> > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > want compositors doing that because my users will complain about
> > > performance issues due to the extra blit.
> > >
> > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > all our problems, right?  It does, right up until someone decides that
> > > they wan to screen capture their Wayland session via some hardware
> > > media encoder that doesn't support explicit sync.  Now we have to
> > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > until they're guaranteed that they can use explicit sync when
> > > displaying on X11 or Wayland.  Are you seeing the problem?
> > >
> > > To make matters worse, since most things are doing implicit
> > > synchronization today, it's really easy to get your explicit
> > > synchronization wrong and never notice.  If you forget to pass a
> > > sync_file into one place (say you never notice KMS doesn't support
> > > them), it will probably work anyway thanks to all the implicit sync
> > > that's going on elsewhere.
> > >
> > > So, clearly, we all need to go write piles of code that we can't
> > > actually properly test until everyone else has written their piece and
> > > then we use explicit sync if and only if all components support it.
> > > Really?  We're going to do multiple years of development and then just
> > > hope it works when we finally flip the switch?  That doesn't sound
> > > like a good plan to me.
> > >
> > >
> > > ## A proposal: Implicit and explicit sync together
> > >
> > > How to solve all these chicken-and-egg problems is something I've been
> > > giving quite a bit of thought (and talking with many others about) in
> > > the last couple of years.  One motivation for this is that we have to
> > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > becoming increasingly unhappy with the way that synchronization,
> > > memory residency, and command submission are inherently intertwined in
> > > i915 and would like to break things apart.  Towards that end, I have
> > > an actual proposal.
> > >
> > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > userspace to manually import or export a sync_file from a dma-buf.
> > > The idea is that something like a Wayland compositor can switch to
> > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > buffers in from a client that doesn't use the explicit sync extension,
> > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > would a sync_file passed via the explicit sync extension.  When it
> > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > sync_files (or if it tries to use that pesky media encoder no one has
> > > converted), it can take it's sync_file for display and stuff it into
> > > the dma-buf before handing it to KMS.
> > >
> > > Along with the kernel patches, I've also implemented support for this
> > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > only requirement on the Vulkan drivers is that you be able to export
> > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > code uses these new ioctls to translate the implicit sync of X11 and
> > > Wayland to the explicit sync the Vulkan driver wants.
> > >
> > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > like this will allow us to finally start moving the Linux ecosystem
> > > over to explicit synchronization one piece at a time in a way that's
> > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > KMS magically works even though it doesn't have a sync_file API.)
> > > Once some pieces in the ecosystem start moving, there will be
> > > motivation to start moving others and maybe we can actually build the
> > > momentum to get most everything converted.
> > >
> > > For reference, you can find the kernel RFC patches and mesa MR here:
> > >
> > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > >
> > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > >
> > > At this point, I welcome your thoughts, comments, objections, and
> > > maybe even help/review. :-)
> > >
> > > --Jason Ekstrand
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17  3:37       ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17  3:37 UTC (permalink / raw)
  To: Roman Gilg
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

On Mon, Mar 16, 2020 at 6:39 PM Roman Gilg <subdiff@gmail.com> wrote:
>
> On Wed, Mar 11, 2020 at 8:21 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > All,
> > >
> > > Sorry for casting such a broad net with this one. I'm sure most people
> > > who reply will get at least one mailing list rejection.  However, this
> > > is an issue that affects a LOT of components and that's why it's
> > > thorny to begin with.  Please pardon the length of this e-mail as
> > > well; I promise there's a concrete point/proposal at the end.
> > >
> > >
> > > Explicit synchronization is the future of graphics and media.  At
> > > least, that seems to be the consensus among all the graphics people
> > > I've talked to.  I had a chat with one of the lead Android graphics
> > > engineers recently who told me that doing explicit sync from the start
> > > was one of the best engineering decisions Android ever made.  It's
> > > also the direction being taken by more modern APIs such as Vulkan.
> > >
> > >
> > > ## What are implicit and explicit synchronization?
> > >
> > > For those that aren't familiar with this space, GPUs, media encoders,
> > > etc. are massively parallel and synchronization of some form is
> > > required to ensure that everything happens in the right order and
> > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > compute, video encode, etc.) are implicitly based on the absolute
> > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > when the client (whatever that means in any given context) provides
> > > the dependency graph explicitly via some sort of synchronization
> > > primitives.  If you're still confused, consider the following
> > > examples:
> > >
> > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > two OpenGL contexts sharing an image where one writes to it and the
> > > other textures from it.  The way the OpenGL spec works, the client has
> > > to make the API calls to render to the image before (in CPU time) it
> > > makes the API calls which texture from the image.  As long as it does
> > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > rendering completes before the texturing happens and you get correct
> > > contents.
> > >
> > > Implicit synchronization can also happen across processes.  Wayland,
> > > for instance, is currently built on implicit sync where the client
> > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > to tell the compositor it's done at which point the compositor can now
> > > texture from the surface.  The hand-off ensures that the client's
> > > OpenGL API calls happen before the server's OpenGL API calls.
> > >
> > > A good example of explicit synchronization is the Vulkan API.  There,
> > > a client (or multiple clients) can simultaneously build command
> > > buffers in different threads where one of those command buffers
> > > renders to an image and the other textures from it and then submit
> > > both of them at the same time with instructions to the driver for
> > > which order to execute them in.  The execution order is described via
> > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > extension, you can even submit the work which does the texturing
> > > BEFORE the work which does the rendering and the driver will sort it
> > > out.
> > >
> > > The #1 problem with implicit synchronization (which explicit solves)
> > > is that it leads to a lot of over-synchronization both in client space
> > > and in driver/device space.  The client has to synchronize a lot more
> > > because it has to ensure that the API calls happen in a particular
> > > order.  The driver/device have to synchronize a lot more because they
> > > never know what is going to end up being a synchronization point as an
> > > API call on another thread/process may occur at any time.  As we move
> > > to more and more multi-threaded programming this synchronization (on
> > > the client-side especially) becomes more and more painful.
> > >
> > >
> > > ## Current status in Linux
> > >
> > > Implicit synchronization in Linux works via a the kernel's internal
> > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > which represents the "done" status for some bit of work.  Typically,
> > > dma_fences are created as a by-product of someone submitting some bit
> > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > set of dma_fences on it representing shared (read) and exclusive
> > > (write) access to the object.  When work is submitted which, for
> > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > on the dma_buf and and a dma_fence is created representing the end of
> > > said rendering work and it's installed as the dma_buf's exclusive
> > > fence.  This way, the kernel can manage all its internal queues (3D
> > > rendering, display, video encode, etc.) and know which things to
> > > submit in what order.
> > >
> > > For the last few years, we've had sync_file in the kernel and it's
> > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > single dma_fence.  A sync_file is typically created as a by-product of
> > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > that work completes.  When a sync_file is created, it is guaranteed by
> > > the kernel that it will become signaled in finite time and, once it's
> > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > represented in UAPIs as a file descriptor and can be used with normal
> > > file APIs such as dup().  It can be passed into another UAPI which
> > > does some bit of queue'd work and the submitted work will wait for the
> > > sync_file to be triggered before executing.  A sync_file also supports
> > > poll() if  you want to wait on it manually.
> > >
> > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > drivers support it.  Here's a very quick overview of my understanding
> > > of the status of various components (I don't know the status of
> > > anything in the media world):
> > >
> > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > synchronization primitives.
> > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > still implicit.
> > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > sync_files around but it's questionable how useful it is right now
> > > (more on that later).
> > >  - X11: With present, it has these "explicit" fence objects but
> > > they're always a shmfence which lets the X server and client do a
> > > userspace CPU-side hand-off without going over the socket (and
> > > round-tripping through the kernel).  However, the only thing that
> > > fence does is order the OpenGL API calls in the client and server and
> > > the real synchronization is still implicit.
> > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > implicitly syncs sometimes due to it's internal memory residency
> > > handling which can lead to over-synchronization.
> > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > explicit sync primitives.
> >
> > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > are still in trouble but most Wayland compositors use atomic these
> > days
>
> Hi Jason,
>
> thanks for pushing this forward and the comprehensive explanation on
> what it is about.
>
> My question would be what exactly do you now need from Wayland compositor devs?
> I understood a Wayland compositor needs to:
> * do atomic page flips,
> * support [1].

Yup, that's pretty much what's needed.  From the looks of
https://gitlab.gnome.org/GNOME/mutter/issues/548, it appears that
mutter is at least on their way to atomic though it also looks like a
long road.

> Is there something else? You described a mechanism to pull out and
> push in these sync_files to dma-bufs depending on what the client
> provides and what kind of output the compositor puts the final image
> onto. That's for now just an idea (plus your wip implementation in
> Vulkan/kernel) and there is not yet anything that can be done for this
> specifically in Wayland compositors, or is there?

The kernel patches I proposed are mostly a way for explicit-sync APIs
such as Vulkan to reasonably interop with the implicit sync world.
Right now, for instance, we have no real plan for Vulkan to be able to
handle synchronization when talking to a media encoder.  We have the
modifiers and dma-buf stuff which deals with image layout but
synchronization is currently an unsolved problem.  If we have a
mechanism for sync_file import/export from dma-buf then an app can use
implicit sync when talking to VAAPI (as an example) and turn that into
sync_files to talk to Vulkan.  Better yet, we could plumb VAAPI for
explicit sync but that's yet one more thing that needs to support
explicit sync.

Make sense?

--Jason


> Thanks
> Roman
>
> [1] https://gitlab.freedesktop.org/wayland/wayland-protocols/blob/master/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml
>
>
> > >  - v4l: ???
> > >  - gstreamer: ???
> > >  - Media APIs such as vaapi etc.:  ???
> > >
> > >
> > > ## Chicken and egg problems
> > >
> > > Ok, this is where it starts getting depressing.  I made the claim
> > > above that Wayland has an explicit synchronization protocol that's of
> > > questionable usefulness.  I would claim that basically any bit of
> > > plumbing we do through window systems is currently of questionable
> > > usefulness.  Why?
> > >
> > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > me because I can't really use it unless it's implemented in all of the
> > > compositors.  Until every Wayland compositor I care about my users
> > > being able to use (which is basically all of them) supports the
> > > extension, I have to continue carry around my pile of hacks to keep
> > > implicit sync and Vulkan working nicely together.
> > >
> > > From the perspective of a Wayland compositor (I used to play in this
> > > space), they'd love to implement the new explicit sync extension but
> > > can't.  Sure, they could wire up the extension, but the moment they go
> > > to flip a client buffer to the screen directly, they discover that KMS
> > > doesn't support any explicit sync APIs.
> >
> > As per the above correction, Wayland compositors aren't nearly as bad
> > off as I initially thought.  There may still be weird screen capture
> > cases but the normal cases of compositing and displaying via
> > KMS/atomic should be in reasonably good shape.
> >
> > > So, yes, they can technically
> > > implement the extension assuming the EGL stack they're running on has
> > > the sync_file extensions but any client buffers which come in using
> > > the explicit sync Wayland extension have to be composited and can't be
> > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > want compositors doing that because my users will complain about
> > > performance issues due to the extra blit.
> > >
> > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > all our problems, right?  It does, right up until someone decides that
> > > they wan to screen capture their Wayland session via some hardware
> > > media encoder that doesn't support explicit sync.  Now we have to
> > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > until they're guaranteed that they can use explicit sync when
> > > displaying on X11 or Wayland.  Are you seeing the problem?
> > >
> > > To make matters worse, since most things are doing implicit
> > > synchronization today, it's really easy to get your explicit
> > > synchronization wrong and never notice.  If you forget to pass a
> > > sync_file into one place (say you never notice KMS doesn't support
> > > them), it will probably work anyway thanks to all the implicit sync
> > > that's going on elsewhere.
> > >
> > > So, clearly, we all need to go write piles of code that we can't
> > > actually properly test until everyone else has written their piece and
> > > then we use explicit sync if and only if all components support it.
> > > Really?  We're going to do multiple years of development and then just
> > > hope it works when we finally flip the switch?  That doesn't sound
> > > like a good plan to me.
> > >
> > >
> > > ## A proposal: Implicit and explicit sync together
> > >
> > > How to solve all these chicken-and-egg problems is something I've been
> > > giving quite a bit of thought (and talking with many others about) in
> > > the last couple of years.  One motivation for this is that we have to
> > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > becoming increasingly unhappy with the way that synchronization,
> > > memory residency, and command submission are inherently intertwined in
> > > i915 and would like to break things apart.  Towards that end, I have
> > > an actual proposal.
> > >
> > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > userspace to manually import or export a sync_file from a dma-buf.
> > > The idea is that something like a Wayland compositor can switch to
> > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > buffers in from a client that doesn't use the explicit sync extension,
> > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > would a sync_file passed via the explicit sync extension.  When it
> > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > sync_files (or if it tries to use that pesky media encoder no one has
> > > converted), it can take it's sync_file for display and stuff it into
> > > the dma-buf before handing it to KMS.
> > >
> > > Along with the kernel patches, I've also implemented support for this
> > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > only requirement on the Vulkan drivers is that you be able to export
> > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > code uses these new ioctls to translate the implicit sync of X11 and
> > > Wayland to the explicit sync the Vulkan driver wants.
> > >
> > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > like this will allow us to finally start moving the Linux ecosystem
> > > over to explicit synchronization one piece at a time in a way that's
> > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > KMS magically works even though it doesn't have a sync_file API.)
> > > Once some pieces in the ecosystem start moving, there will be
> > > motivation to start moving others and maybe we can actually build the
> > > momentum to get most everything converted.
> > >
> > > For reference, you can find the kernel RFC patches and mesa MR here:
> > >
> > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > >
> > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > >
> > > At this point, I welcome your thoughts, comments, objections, and
> > > maybe even help/review. :-)
> > >
> > > --Jason Ekstrand
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17  3:37       ` Jason Ekstrand
@ 2020-03-17  7:53         ` Jonas Ådahl
  -1 siblings, 0 replies; 101+ messages in thread
From: Jonas Ådahl @ 2020-03-17  7:53 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Roman Gilg, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

On Mon, Mar 16, 2020 at 10:37:04PM -0500, Jason Ekstrand wrote:
> On Mon, Mar 16, 2020 at 6:39 PM Roman Gilg <subdiff@gmail.com> wrote:
> >
> > On Wed, Mar 11, 2020 at 8:21 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >
> > > > All,
> > > >
> > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > who reply will get at least one mailing list rejection.  However, this
> > > > is an issue that affects a LOT of components and that's why it's
> > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > well; I promise there's a concrete point/proposal at the end.
> > > >
> > > >
> > > > Explicit synchronization is the future of graphics and media.  At
> > > > least, that seems to be the consensus among all the graphics people
> > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > engineers recently who told me that doing explicit sync from the start
> > > > was one of the best engineering decisions Android ever made.  It's
> > > > also the direction being taken by more modern APIs such as Vulkan.
> > > >
> > > >
> > > > ## What are implicit and explicit synchronization?
> > > >
> > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > etc. are massively parallel and synchronization of some form is
> > > > required to ensure that everything happens in the right order and
> > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > when the client (whatever that means in any given context) provides
> > > > the dependency graph explicitly via some sort of synchronization
> > > > primitives.  If you're still confused, consider the following
> > > > examples:
> > > >
> > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > to make the API calls to render to the image before (in CPU time) it
> > > > makes the API calls which texture from the image.  As long as it does
> > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > rendering completes before the texturing happens and you get correct
> > > > contents.
> > > >
> > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > for instance, is currently built on implicit sync where the client
> > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > to tell the compositor it's done at which point the compositor can now
> > > > texture from the surface.  The hand-off ensures that the client's
> > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > >
> > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > a client (or multiple clients) can simultaneously build command
> > > > buffers in different threads where one of those command buffers
> > > > renders to an image and the other textures from it and then submit
> > > > both of them at the same time with instructions to the driver for
> > > > which order to execute them in.  The execution order is described via
> > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > extension, you can even submit the work which does the texturing
> > > > BEFORE the work which does the rendering and the driver will sort it
> > > > out.
> > > >
> > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > is that it leads to a lot of over-synchronization both in client space
> > > > and in driver/device space.  The client has to synchronize a lot more
> > > > because it has to ensure that the API calls happen in a particular
> > > > order.  The driver/device have to synchronize a lot more because they
> > > > never know what is going to end up being a synchronization point as an
> > > > API call on another thread/process may occur at any time.  As we move
> > > > to more and more multi-threaded programming this synchronization (on
> > > > the client-side especially) becomes more and more painful.
> > > >
> > > >
> > > > ## Current status in Linux
> > > >
> > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > which represents the "done" status for some bit of work.  Typically,
> > > > dma_fences are created as a by-product of someone submitting some bit
> > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > set of dma_fences on it representing shared (read) and exclusive
> > > > (write) access to the object.  When work is submitted which, for
> > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > rendering, display, video encode, etc.) and know which things to
> > > > submit in what order.
> > > >
> > > > For the last few years, we've had sync_file in the kernel and it's
> > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > the kernel that it will become signaled in finite time and, once it's
> > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > does some bit of queue'd work and the submitted work will wait for the
> > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > poll() if  you want to wait on it manually.
> > > >
> > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > drivers support it.  Here's a very quick overview of my understanding
> > > > of the status of various components (I don't know the status of
> > > > anything in the media world):
> > > >
> > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > synchronization primitives.
> > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > still implicit.
> > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > sync_files around but it's questionable how useful it is right now
> > > > (more on that later).
> > > >  - X11: With present, it has these "explicit" fence objects but
> > > > they're always a shmfence which lets the X server and client do a
> > > > userspace CPU-side hand-off without going over the socket (and
> > > > round-tripping through the kernel).  However, the only thing that
> > > > fence does is order the OpenGL API calls in the client and server and
> > > > the real synchronization is still implicit.
> > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > implicitly syncs sometimes due to it's internal memory residency
> > > > handling which can lead to over-synchronization.
> > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > explicit sync primitives.
> > >
> > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > are still in trouble but most Wayland compositors use atomic these
> > > days
> >
> > Hi Jason,
> >
> > thanks for pushing this forward and the comprehensive explanation on
> > what it is about.
> >
> > My question would be what exactly do you now need from Wayland compositor devs?
> > I understood a Wayland compositor needs to:
> > * do atomic page flips,
> > * support [1].
> 
> Yup, that's pretty much what's needed.  From the looks of
> https://gitlab.gnome.org/GNOME/mutter/issues/548, it appears that
> mutter is at least on their way to atomic though it also looks like a
> long road.

FWIW, I expect to have atomic KMS to be used when available in the
driver on GNOME 3.38. We just released 3.36, so this means it would be
the next version.  Whether it makes sense to eventually backport to 3.36
we'll see, but wouldn't be impossible. The majority of the work has
already been done.


Jonas

> 
> > Is there something else? You described a mechanism to pull out and
> > push in these sync_files to dma-bufs depending on what the client
> > provides and what kind of output the compositor puts the final image
> > onto. That's for now just an idea (plus your wip implementation in
> > Vulkan/kernel) and there is not yet anything that can be done for this
> > specifically in Wayland compositors, or is there?
> 
> The kernel patches I proposed are mostly a way for explicit-sync APIs
> such as Vulkan to reasonably interop with the implicit sync world.
> Right now, for instance, we have no real plan for Vulkan to be able to
> handle synchronization when talking to a media encoder.  We have the
> modifiers and dma-buf stuff which deals with image layout but
> synchronization is currently an unsolved problem.  If we have a
> mechanism for sync_file import/export from dma-buf then an app can use
> implicit sync when talking to VAAPI (as an example) and turn that into
> sync_files to talk to Vulkan.  Better yet, we could plumb VAAPI for
> explicit sync but that's yet one more thing that needs to support
> explicit sync.
> 
> Make sense?
> 
> --Jason
> 
> 
> > Thanks
> > Roman
> >
> > [1] https://gitlab.freedesktop.org/wayland/wayland-protocols/blob/master/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml
> >
> >
> > > >  - v4l: ???
> > > >  - gstreamer: ???
> > > >  - Media APIs such as vaapi etc.:  ???
> > > >
> > > >
> > > > ## Chicken and egg problems
> > > >
> > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > above that Wayland has an explicit synchronization protocol that's of
> > > > questionable usefulness.  I would claim that basically any bit of
> > > > plumbing we do through window systems is currently of questionable
> > > > usefulness.  Why?
> > > >
> > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > me because I can't really use it unless it's implemented in all of the
> > > > compositors.  Until every Wayland compositor I care about my users
> > > > being able to use (which is basically all of them) supports the
> > > > extension, I have to continue carry around my pile of hacks to keep
> > > > implicit sync and Vulkan working nicely together.
> > > >
> > > > From the perspective of a Wayland compositor (I used to play in this
> > > > space), they'd love to implement the new explicit sync extension but
> > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > doesn't support any explicit sync APIs.
> > >
> > > As per the above correction, Wayland compositors aren't nearly as bad
> > > off as I initially thought.  There may still be weird screen capture
> > > cases but the normal cases of compositing and displaying via
> > > KMS/atomic should be in reasonably good shape.
> > >
> > > > So, yes, they can technically
> > > > implement the extension assuming the EGL stack they're running on has
> > > > the sync_file extensions but any client buffers which come in using
> > > > the explicit sync Wayland extension have to be composited and can't be
> > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > want compositors doing that because my users will complain about
> > > > performance issues due to the extra blit.
> > > >
> > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > all our problems, right?  It does, right up until someone decides that
> > > > they wan to screen capture their Wayland session via some hardware
> > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > until they're guaranteed that they can use explicit sync when
> > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > >
> > > > To make matters worse, since most things are doing implicit
> > > > synchronization today, it's really easy to get your explicit
> > > > synchronization wrong and never notice.  If you forget to pass a
> > > > sync_file into one place (say you never notice KMS doesn't support
> > > > them), it will probably work anyway thanks to all the implicit sync
> > > > that's going on elsewhere.
> > > >
> > > > So, clearly, we all need to go write piles of code that we can't
> > > > actually properly test until everyone else has written their piece and
> > > > then we use explicit sync if and only if all components support it.
> > > > Really?  We're going to do multiple years of development and then just
> > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > like a good plan to me.
> > > >
> > > >
> > > > ## A proposal: Implicit and explicit sync together
> > > >
> > > > How to solve all these chicken-and-egg problems is something I've been
> > > > giving quite a bit of thought (and talking with many others about) in
> > > > the last couple of years.  One motivation for this is that we have to
> > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > becoming increasingly unhappy with the way that synchronization,
> > > > memory residency, and command submission are inherently intertwined in
> > > > i915 and would like to break things apart.  Towards that end, I have
> > > > an actual proposal.
> > > >
> > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > The idea is that something like a Wayland compositor can switch to
> > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > would a sync_file passed via the explicit sync extension.  When it
> > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > converted), it can take it's sync_file for display and stuff it into
> > > > the dma-buf before handing it to KMS.
> > > >
> > > > Along with the kernel patches, I've also implemented support for this
> > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > only requirement on the Vulkan drivers is that you be able to export
> > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > Wayland to the explicit sync the Vulkan driver wants.
> > > >
> > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > like this will allow us to finally start moving the Linux ecosystem
> > > > over to explicit synchronization one piece at a time in a way that's
> > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > Once some pieces in the ecosystem start moving, there will be
> > > > motivation to start moving others and maybe we can actually build the
> > > > momentum to get most everything converted.
> > > >
> > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > >
> > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > >
> > > > At this point, I welcome your thoughts, comments, objections, and
> > > > maybe even help/review. :-)
> > > >
> > > > --Jason Ekstrand
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17  7:53         ` Jonas Ådahl
  0 siblings, 0 replies; 101+ messages in thread
From: Jonas Ådahl @ 2020-03-17  7:53 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Roman Gilg, linux-media

On Mon, Mar 16, 2020 at 10:37:04PM -0500, Jason Ekstrand wrote:
> On Mon, Mar 16, 2020 at 6:39 PM Roman Gilg <subdiff@gmail.com> wrote:
> >
> > On Wed, Mar 11, 2020 at 8:21 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >
> > > > All,
> > > >
> > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > who reply will get at least one mailing list rejection.  However, this
> > > > is an issue that affects a LOT of components and that's why it's
> > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > well; I promise there's a concrete point/proposal at the end.
> > > >
> > > >
> > > > Explicit synchronization is the future of graphics and media.  At
> > > > least, that seems to be the consensus among all the graphics people
> > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > engineers recently who told me that doing explicit sync from the start
> > > > was one of the best engineering decisions Android ever made.  It's
> > > > also the direction being taken by more modern APIs such as Vulkan.
> > > >
> > > >
> > > > ## What are implicit and explicit synchronization?
> > > >
> > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > etc. are massively parallel and synchronization of some form is
> > > > required to ensure that everything happens in the right order and
> > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > when the client (whatever that means in any given context) provides
> > > > the dependency graph explicitly via some sort of synchronization
> > > > primitives.  If you're still confused, consider the following
> > > > examples:
> > > >
> > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > to make the API calls to render to the image before (in CPU time) it
> > > > makes the API calls which texture from the image.  As long as it does
> > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > rendering completes before the texturing happens and you get correct
> > > > contents.
> > > >
> > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > for instance, is currently built on implicit sync where the client
> > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > to tell the compositor it's done at which point the compositor can now
> > > > texture from the surface.  The hand-off ensures that the client's
> > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > >
> > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > a client (or multiple clients) can simultaneously build command
> > > > buffers in different threads where one of those command buffers
> > > > renders to an image and the other textures from it and then submit
> > > > both of them at the same time with instructions to the driver for
> > > > which order to execute them in.  The execution order is described via
> > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > extension, you can even submit the work which does the texturing
> > > > BEFORE the work which does the rendering and the driver will sort it
> > > > out.
> > > >
> > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > is that it leads to a lot of over-synchronization both in client space
> > > > and in driver/device space.  The client has to synchronize a lot more
> > > > because it has to ensure that the API calls happen in a particular
> > > > order.  The driver/device have to synchronize a lot more because they
> > > > never know what is going to end up being a synchronization point as an
> > > > API call on another thread/process may occur at any time.  As we move
> > > > to more and more multi-threaded programming this synchronization (on
> > > > the client-side especially) becomes more and more painful.
> > > >
> > > >
> > > > ## Current status in Linux
> > > >
> > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > which represents the "done" status for some bit of work.  Typically,
> > > > dma_fences are created as a by-product of someone submitting some bit
> > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > set of dma_fences on it representing shared (read) and exclusive
> > > > (write) access to the object.  When work is submitted which, for
> > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > rendering, display, video encode, etc.) and know which things to
> > > > submit in what order.
> > > >
> > > > For the last few years, we've had sync_file in the kernel and it's
> > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > the kernel that it will become signaled in finite time and, once it's
> > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > does some bit of queue'd work and the submitted work will wait for the
> > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > poll() if  you want to wait on it manually.
> > > >
> > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > drivers support it.  Here's a very quick overview of my understanding
> > > > of the status of various components (I don't know the status of
> > > > anything in the media world):
> > > >
> > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > synchronization primitives.
> > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > still implicit.
> > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > sync_files around but it's questionable how useful it is right now
> > > > (more on that later).
> > > >  - X11: With present, it has these "explicit" fence objects but
> > > > they're always a shmfence which lets the X server and client do a
> > > > userspace CPU-side hand-off without going over the socket (and
> > > > round-tripping through the kernel).  However, the only thing that
> > > > fence does is order the OpenGL API calls in the client and server and
> > > > the real synchronization is still implicit.
> > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync.
> > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > implicitly syncs sometimes due to it's internal memory residency
> > > > handling which can lead to over-synchronization.
> > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > explicit sync primitives.
> > >
> > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > are still in trouble but most Wayland compositors use atomic these
> > > days
> >
> > Hi Jason,
> >
> > thanks for pushing this forward and the comprehensive explanation on
> > what it is about.
> >
> > My question would be what exactly do you now need from Wayland compositor devs?
> > I understood a Wayland compositor needs to:
> > * do atomic page flips,
> > * support [1].
> 
> Yup, that's pretty much what's needed.  From the looks of
> https://gitlab.gnome.org/GNOME/mutter/issues/548, it appears that
> mutter is at least on their way to atomic though it also looks like a
> long road.

FWIW, I expect to have atomic KMS to be used when available in the
driver on GNOME 3.38. We just released 3.36, so this means it would be
the next version.  Whether it makes sense to eventually backport to 3.36
we'll see, but wouldn't be impossible. The majority of the work has
already been done.


Jonas

> 
> > Is there something else? You described a mechanism to pull out and
> > push in these sync_files to dma-bufs depending on what the client
> > provides and what kind of output the compositor puts the final image
> > onto. That's for now just an idea (plus your wip implementation in
> > Vulkan/kernel) and there is not yet anything that can be done for this
> > specifically in Wayland compositors, or is there?
> 
> The kernel patches I proposed are mostly a way for explicit-sync APIs
> such as Vulkan to reasonably interop with the implicit sync world.
> Right now, for instance, we have no real plan for Vulkan to be able to
> handle synchronization when talking to a media encoder.  We have the
> modifiers and dma-buf stuff which deals with image layout but
> synchronization is currently an unsolved problem.  If we have a
> mechanism for sync_file import/export from dma-buf then an app can use
> implicit sync when talking to VAAPI (as an example) and turn that into
> sync_files to talk to Vulkan.  Better yet, we could plumb VAAPI for
> explicit sync but that's yet one more thing that needs to support
> explicit sync.
> 
> Make sense?
> 
> --Jason
> 
> 
> > Thanks
> > Roman
> >
> > [1] https://gitlab.freedesktop.org/wayland/wayland-protocols/blob/master/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml
> >
> >
> > > >  - v4l: ???
> > > >  - gstreamer: ???
> > > >  - Media APIs such as vaapi etc.:  ???
> > > >
> > > >
> > > > ## Chicken and egg problems
> > > >
> > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > above that Wayland has an explicit synchronization protocol that's of
> > > > questionable usefulness.  I would claim that basically any bit of
> > > > plumbing we do through window systems is currently of questionable
> > > > usefulness.  Why?
> > > >
> > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > me because I can't really use it unless it's implemented in all of the
> > > > compositors.  Until every Wayland compositor I care about my users
> > > > being able to use (which is basically all of them) supports the
> > > > extension, I have to continue carry around my pile of hacks to keep
> > > > implicit sync and Vulkan working nicely together.
> > > >
> > > > From the perspective of a Wayland compositor (I used to play in this
> > > > space), they'd love to implement the new explicit sync extension but
> > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > doesn't support any explicit sync APIs.
> > >
> > > As per the above correction, Wayland compositors aren't nearly as bad
> > > off as I initially thought.  There may still be weird screen capture
> > > cases but the normal cases of compositing and displaying via
> > > KMS/atomic should be in reasonably good shape.
> > >
> > > > So, yes, they can technically
> > > > implement the extension assuming the EGL stack they're running on has
> > > > the sync_file extensions but any client buffers which come in using
> > > > the explicit sync Wayland extension have to be composited and can't be
> > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > want compositors doing that because my users will complain about
> > > > performance issues due to the extra blit.
> > > >
> > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > all our problems, right?  It does, right up until someone decides that
> > > > they wan to screen capture their Wayland session via some hardware
> > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > until they're guaranteed that they can use explicit sync when
> > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > >
> > > > To make matters worse, since most things are doing implicit
> > > > synchronization today, it's really easy to get your explicit
> > > > synchronization wrong and never notice.  If you forget to pass a
> > > > sync_file into one place (say you never notice KMS doesn't support
> > > > them), it will probably work anyway thanks to all the implicit sync
> > > > that's going on elsewhere.
> > > >
> > > > So, clearly, we all need to go write piles of code that we can't
> > > > actually properly test until everyone else has written their piece and
> > > > then we use explicit sync if and only if all components support it.
> > > > Really?  We're going to do multiple years of development and then just
> > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > like a good plan to me.
> > > >
> > > >
> > > > ## A proposal: Implicit and explicit sync together
> > > >
> > > > How to solve all these chicken-and-egg problems is something I've been
> > > > giving quite a bit of thought (and talking with many others about) in
> > > > the last couple of years.  One motivation for this is that we have to
> > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > becoming increasingly unhappy with the way that synchronization,
> > > > memory residency, and command submission are inherently intertwined in
> > > > i915 and would like to break things apart.  Towards that end, I have
> > > > an actual proposal.
> > > >
> > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > The idea is that something like a Wayland compositor can switch to
> > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > would a sync_file passed via the explicit sync extension.  When it
> > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > converted), it can take it's sync_file for display and stuff it into
> > > > the dma-buf before handing it to KMS.
> > > >
> > > > Along with the kernel patches, I've also implemented support for this
> > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > only requirement on the Vulkan drivers is that you be able to export
> > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > Wayland to the explicit sync the Vulkan driver wants.
> > > >
> > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > like this will allow us to finally start moving the Linux ecosystem
> > > > over to explicit synchronization one piece at a time in a way that's
> > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > Once some pieces in the ecosystem start moving, there will be
> > > > motivation to start moving others and maybe we can actually build the
> > > > momentum to get most everything converted.
> > > >
> > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > >
> > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > >
> > > > At this point, I welcome your thoughts, comments, objections, and
> > > > maybe even help/review. :-)
> > > >
> > > > --Jason Ekstrand
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 16:04               ` Jason Ekstrand
@ 2020-03-17  8:01                 ` Simon Ser
  -1 siblings, 0 replies; 101+ messages in thread
From: Simon Ser @ 2020-03-17  8:01 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Tomek Bury, Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers, Nicolas Dufresne,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Bas Nieuwenhuizen, ML mesa-dev, Daniel Stone, Dave Airlie,
	Discussion of the development of and with GStreamer

On Monday, March 16, 2020 5:04 PM, Jason Ekstrand <jason@jlekstrand.net> wrote:

> Hopefully, that will provide some motivation for other compositors
> (kwin, gnome-shell, etc.) because they now have a real user of it in
> an upstream driver for a major desktop platform and not just a few
> weston examples. However, someone is going to have to drive the
> actual development in those compositors. I'd be very happy if more
> people got involved, :-)

FWIW, a wlroots pull request is in progress [0]. The plan is first to
accept fence FDs from clients, then send them our fences as a second
step.

[0]: https://github.com/swaywm/wlroots/pull/2070

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17  8:01                 ` Simon Ser
  0 siblings, 0 replies; 101+ messages in thread
From: Simon Ser @ 2020-03-17  8:01 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: ML mesa-dev, Daniel Vetter, xorg-devel,
	Maling list - DRI developers, Nicolas Dufresne, Laurent Pinchart,
	Discussion of the development of and with GStreamer, Tomek Bury,
	wayland-devel @ lists . freedesktop . org,
	open list:DMA BUFFER SHARING FRAMEWORK

On Monday, March 16, 2020 5:04 PM, Jason Ekstrand <jason@jlekstrand.net> wrote:

> Hopefully, that will provide some motivation for other compositors
> (kwin, gnome-shell, etc.) because they now have a real user of it in
> an upstream driver for a major desktop platform and not just a few
> weston examples. However, someone is going to have to drive the
> actual development in those compositors. I'd be very happy if more
> people got involved, :-)

FWIW, a wlroots pull request is in progress [0]. The plan is first to
accept fence FDs from clients, then send them our fences as a second
step.

[0]: https://github.com/swaywm/wlroots/pull/2070
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 18:33         ` Marek Olšák
@ 2020-03-17 10:01             ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-17 10:01 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net> wrote:
>> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
>>> The synchronization works because the Mesa driver waits for idle (drains
>>> the GFX pipeline) at the end of command buffers and there is only 1
>>> graphics queue, so everything is ordered.
>>>
>>> The GFX pipeline runs asynchronously to the command buffer, meaning the
>>> command buffer only starts draws and doesn't wait for completion. If the
>>> Mesa driver didn't wait at the end of the command buffer, the command
>>> buffer would finish and a different process could start execution of its
>>> own command buffer while shaders of the previous process are still
>> running.
>>>
>>> If the Mesa driver submits a command buffer internally (because it's
>> full),
>>> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
>>> ended and a new one started.
>>>
>>> The waiting at the end of command buffers happens only when the flush is
>>> external (Swap buffers, glFlush).
>>>
>>> It's a performance problem, because the GFX queue is blocked until the
>> GFX
>>> pipeline is drained at the end of every frame at least.
>>>
>>> So explicit fences for SwapBuffers would help.
>>
>> Not sure what difference it would make, since the same thing needs to be
>> done for explicit fences as well, doesn't it?
> 
> No. Explicit fences don't require userspace to wait for idle in the command
> buffer. Fences are signalled when the last draw is complete and caches are
> flushed. Before that happens, any command buffer that is not dependent on
> the fence can start execution. There is never a need for the GPU to be idle
> if there is enough independent work to do.

I don't think explicit fences in the context of this discussion imply
using that different fence signalling mechanism though. My understanding
is that the API proposed by Jason allows implicit fences to be used as
explicit ones and vice versa, so presumably they have to use the same
signalling mechanism.


Anyway, maybe the different fence signalling mechanism you describe
could be used by the amdgpu kernel driver in general, then Mesa could
drop the waits for idle and get the benefits with implicit sync as well?


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 10:01             ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-17 10:01 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net> wrote:
>> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
>>> The synchronization works because the Mesa driver waits for idle (drains
>>> the GFX pipeline) at the end of command buffers and there is only 1
>>> graphics queue, so everything is ordered.
>>>
>>> The GFX pipeline runs asynchronously to the command buffer, meaning the
>>> command buffer only starts draws and doesn't wait for completion. If the
>>> Mesa driver didn't wait at the end of the command buffer, the command
>>> buffer would finish and a different process could start execution of its
>>> own command buffer while shaders of the previous process are still
>> running.
>>>
>>> If the Mesa driver submits a command buffer internally (because it's
>> full),
>>> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
>>> ended and a new one started.
>>>
>>> The waiting at the end of command buffers happens only when the flush is
>>> external (Swap buffers, glFlush).
>>>
>>> It's a performance problem, because the GFX queue is blocked until the
>> GFX
>>> pipeline is drained at the end of every frame at least.
>>>
>>> So explicit fences for SwapBuffers would help.
>>
>> Not sure what difference it would make, since the same thing needs to be
>> done for explicit fences as well, doesn't it?
> 
> No. Explicit fences don't require userspace to wait for idle in the command
> buffer. Fences are signalled when the last draw is complete and caches are
> flushed. Before that happens, any command buffer that is not dependent on
> the fence can start execution. There is never a need for the GPU to be idle
> if there is enough independent work to do.

I don't think explicit fences in the context of this discussion imply
using that different fence signalling mechanism though. My understanding
is that the API proposed by Jason allows implicit fences to be used as
explicit ones and vice versa, so presumably they have to use the same
signalling mechanism.


Anyway, maybe the different fence signalling mechanism you describe
could be used by the amdgpu kernel driver in general, then Mesa could
drop the waits for idle and get the benefits with implicit sync as well?


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17  8:01                 ` Simon Ser
@ 2020-03-17 14:38                   ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17 14:38 UTC (permalink / raw)
  To: Simon Ser
  Cc: Tomek Bury, Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers, Nicolas Dufresne,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Bas Nieuwenhuizen, ML mesa-dev, Daniel Stone, Dave Airlie,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 3:01 AM Simon Ser <contact@emersion.fr> wrote:
>
> On Monday, March 16, 2020 5:04 PM, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> > Hopefully, that will provide some motivation for other compositors
> > (kwin, gnome-shell, etc.) because they now have a real user of it in
> > an upstream driver for a major desktop platform and not just a few
> > weston examples. However, someone is going to have to drive the
> > actual development in those compositors. I'd be very happy if more
> > people got involved, :-)
>
> FWIW, a wlroots pull request is in progress [0]. The plan is first to
> accept fence FDs from clients, then send them our fences as a second
> step.

What exactly are the semantics there?  Are you going to somehow wait
inside wlroots for the buffer to be 100% idle or are you expecting the
client to somehow use explicit for sending buffers implicit to wait
for idle?  If it's the latter, that's not going to work.

--Jason

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 14:38                   ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17 14:38 UTC (permalink / raw)
  To: Simon Ser
  Cc: ML mesa-dev, Daniel Vetter, xorg-devel,
	Maling list - DRI developers, Nicolas Dufresne, Laurent Pinchart,
	Discussion of the development of and with GStreamer, Tomek Bury,
	wayland-devel @ lists . freedesktop . org,
	open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Mar 17, 2020 at 3:01 AM Simon Ser <contact@emersion.fr> wrote:
>
> On Monday, March 16, 2020 5:04 PM, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> > Hopefully, that will provide some motivation for other compositors
> > (kwin, gnome-shell, etc.) because they now have a real user of it in
> > an upstream driver for a major desktop platform and not just a few
> > weston examples. However, someone is going to have to drive the
> > actual development in those compositors. I'd be very happy if more
> > people got involved, :-)
>
> FWIW, a wlroots pull request is in progress [0]. The plan is first to
> accept fence FDs from clients, then send them our fences as a second
> step.

What exactly are the semantics there?  Are you going to somehow wait
inside wlroots for the buffer to be 100% idle or are you expecting the
client to somehow use explicit for sending buffers implicit to wait
for idle?  If it's the latter, that's not going to work.

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-16 21:15           ` Laurent Pinchart
@ 2020-03-17 15:33             ` Nicolas Dufresne
  -1 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-17 15:33 UTC (permalink / raw)
  To: Laurent Pinchart, Jason Ekstrand
  Cc: ML mesa-dev, Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> Hi Jason,
> 
> On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > 
> > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > All,
> > > > > > 
> > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > 
> > > > > > 
> > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > 
> > > > > > 
> > > > > > ## What are implicit and explicit synchronization?
> > > > > > 
> > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > required to ensure that everything happens in the right order and
> > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > when the client (whatever that means in any given context) provides
> > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > primitives.  If you're still confused, consider the following
> > > > > > examples:
> > > > > > 
> > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > rendering completes before the texturing happens and you get correct
> > > > > > contents.
> > > > > > 
> > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > for instance, is currently built on implicit sync where the client
> > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > 
> > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > buffers in different threads where one of those command buffers
> > > > > > renders to an image and the other textures from it and then submit
> > > > > > both of them at the same time with instructions to the driver for
> > > > > > which order to execute them in.  The execution order is described via
> > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > extension, you can even submit the work which does the texturing
> > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > out.
> > > > > > 
> > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > never know what is going to end up being a synchronization point as an
> > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > the client-side especially) becomes more and more painful.
> > > > > > 
> > > > > > 
> > > > > > ## Current status in Linux
> > > > > > 
> > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > submit in what order.
> > > > > > 
> > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > poll() if  you want to wait on it manually.
> > > > > > 
> > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > of the status of various components (I don't know the status of
> > > > > > anything in the media world):
> > > > > > 
> > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > synchronization primitives.
> > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > still implicit.
> > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > (more on that later).
> > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > the real synchronization is still implicit.
> > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > sync.
> > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > handling which can lead to over-synchronization.
> > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > explicit sync primitives.
> > > > > 
> > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > days
> > > > > 
> > > > > >  - v4l: ???
> > > > > >  - gstreamer: ???
> > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > 
> > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > synchronisation is something we do already with GL (even if limited). We place
> > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > into the same GL context). That requires a special mode of operation of course.
> > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > (it's plugin base, so this can be really messy otherwise).
> > > > 
> > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > enough.
> > > 
> > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > When you queue multiple buffers for camera capture for instance, you
> > > don't know until capture complete in which buffer the frame has been
> > > captured.
> > 
> > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > start of frame capture which buffer it's getting written into.  I
> > would think that the kernel APIs could be adjusted (if we find good
> > reason to do so!) such that they return earlier and return a (buffer,
> > fence) pair.  Am I missing something fundamental about video here?
> 
> For cameras I believe we could do that, yes. I was pointing out the
> issues caused by the current API. For video decoders I'll let Nicolas
> answer the question, he's way more knowledgeable that I am on that
> topic.

Right now, there is simply no uAPI for supporting asynchronous errors
reporting when fences are invovled. That is true for both camera's and
CODEC. It's likely what all the attempt was missing, I don't know
enough myself to suggest something.

Now, why Stateless video decoders are special is another subject. In
CODECs, the decoding and the presentation order may differ. For
Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
if this bitstream is fully valid, since the it is being parsed and
validated by the firmware. It's also firmware job to decide which
buffer should be presented first.

In most firmware interface, that information is communicated back all
at once when the frame is ready to be presented (which may be quite
some time after it was decoded). So indeed, a fence model is not really
easy to add, unless the firmware was designed with that model in mind.

Nothing of course would prevent V4L2 framework to generically handle
out_fence from other producers. It does not even handle implicit fences
at the moment, which is already quite problematic (I've seen glitches
on i.MX6/8 and Raspberry Pi 4).

In that specific case, if the fences from etnaviv, vc graphic drivers
was exposed, we could solve this issue in userspace. Right now it's
implicit, so we rely on all DMABuf driver to have proper support, which
is not the case. There is V4L2 support for that coming, but the wait is
done synchronously in userspace call that was normally non-blocking. So
that is unlikely to fly.

Small note, stateless video decoders don't have this issue. The
bitstream is validated by userspace, and userspace controls the
"decode" operation. This one would be a good case for bidirectional
fencing.

> 
> > I must admit that V4L is a bit of an odd case since the kernel driver
> > is the producer and not the consumer.
> 
> Note that V4L2 can be a consumer too. Video output with V4L2 is less
> frequent than video capture (but it still exists), and codecs and other
> memory-to-memory processing devices (colorspace converters, scalers,
> ...) are both consumers and producers.
> 
> > > In the normal case buffers are processed in sequence, but if
> > > an error occurs during capture, they can be recycled internally and put
> > > to the back of the queue.
> > 
> > Are those errors something that can happen at any time in the middle
> > of a frame capture?  If so, that does make things stickier.
> 
> Yes it can. Think of packet loss when capturing from a USB webcam for
> instance. 
> 
> > > Unless I'm mistaken, this problem also exists
> > > with stateful codecs. And if you don't know in advance which buffer you
> > > will receive from the device, the usefulness of fences becomes very
> > > questionable :-)
> > 
> > Yeah, if you really are in a situation where there's no way to know
> > until the full frame capture has been completed which buffer is next,
> > then fences are useless.  You aren't in an implicit synchronization
> > setting either; you're in a "full flush" setting.  It's arguably worse
> > for performance but perhaps unavoidable?
> 
> Probably unavoidable in some cases, but nothing that should get in the
> way for the discussion at hand: there's no need to migrate away from
> implicit sync when there's implicit sync in the first place :-)
> 
> I think we need to analyse the use cases here, and figure out at least
> guidelines for userspace, otherwise applications will wonder what
> behaviour to implement, and we'll end up with a wide variety of them.
> Even just on the kernel side, some V4L2 capture driver will pass
> erroneous frames to userspace (thus guaranteeing ordering, but without
> early notification of errors), some will require the frame
> automatically, and at least one (uvcvideo) has a module parameter to
> pick the desired behaviour.

Also, from a userspace point of view, the synchronization with the
"next frame" in V4L2 isn't implicit. We can poll() the device, just
like we'd do with a fence FD. What the explicit fence gives, is a
unified object we can pass to another driver, or other userspace, so we
can delegate the wait.

You refer to performance in few places. In streaming, this is often
measure as real-time throughput. Implicit/explicit fences don't really
play any role for us in this regard. V4L2 drivers, like m2m drivers,
works with buffer queues. So you can queue in advance many buffers on
the OUTPUT device side (which is the input of the m2m), and userspace
will queue in advance pretty much all free buffers available on the
CAPTURE side. The driver is never starved in that model, at the cost of
very large memory consumption of course. Maybe a more visual
representation would be:

  [pending job] -> [M2M Worker] -> [pending results]

So as long as userspace keep the pending job queue non-empty, and that 
it consumes and give back buffers back to write the results into, the
driver will keep running un-interrupted. Performance remains optimal.
What isn't optimal is the latency. And what bugs right now is when a
DMAbuf implicit out fence is put back into the pending results queue,
since the fence is ignored.

> 
> > Trying to understand. :-)
> 
> So am I :-)

Hehe, same here.

> 
> > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > uses), so then the operation will just fail where it worked before (breaking
> > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > cleanly as we do for new APIs.
> > > > 
> > > > > > ## Chicken and egg problems
> > > > > > 
> > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > plumbing we do through window systems is currently of questionable
> > > > > > usefulness.  Why?
> > > > > > 
> > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > being able to use (which is basically all of them) supports the
> > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > implicit sync and Vulkan working nicely together.
> > > > > > 
> > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > doesn't support any explicit sync APIs.
> > > > > 
> > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > off as I initially thought.  There may still be weird screen capture
> > > > > cases but the normal cases of compositing and displaying via
> > > > > KMS/atomic should be in reasonably good shape.
> > > > > 
> > > > > > So, yes, they can technically
> > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > want compositors doing that because my users will complain about
> > > > > > performance issues due to the extra blit.
> > > > > > 
> > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > 
> > > > > > To make matters worse, since most things are doing implicit
> > > > > > synchronization today, it's really easy to get your explicit
> > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > that's going on elsewhere.
> > > > > > 
> > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > actually properly test until everyone else has written their piece and
> > > > > > then we use explicit sync if and only if all components support it.
> > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > like a good plan to me.
> > > > > > 
> > > > > > 
> > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > 
> > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > an actual proposal.
> > > > > > 
> > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > the dma-buf before handing it to KMS.
> > > > > > 
> > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > 
> > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > momentum to get most everything converted.
> > > > > > 
> > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > 
> > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > 
> > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > 
> > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > maybe even help/review. :-)


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 15:33             ` Nicolas Dufresne
  0 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-17 15:33 UTC (permalink / raw)
  To: Laurent Pinchart, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> Hi Jason,
> 
> On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > 
> > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > All,
> > > > > > 
> > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > 
> > > > > > 
> > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > 
> > > > > > 
> > > > > > ## What are implicit and explicit synchronization?
> > > > > > 
> > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > required to ensure that everything happens in the right order and
> > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > when the client (whatever that means in any given context) provides
> > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > primitives.  If you're still confused, consider the following
> > > > > > examples:
> > > > > > 
> > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > rendering completes before the texturing happens and you get correct
> > > > > > contents.
> > > > > > 
> > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > for instance, is currently built on implicit sync where the client
> > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > 
> > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > buffers in different threads where one of those command buffers
> > > > > > renders to an image and the other textures from it and then submit
> > > > > > both of them at the same time with instructions to the driver for
> > > > > > which order to execute them in.  The execution order is described via
> > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > extension, you can even submit the work which does the texturing
> > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > out.
> > > > > > 
> > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > never know what is going to end up being a synchronization point as an
> > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > the client-side especially) becomes more and more painful.
> > > > > > 
> > > > > > 
> > > > > > ## Current status in Linux
> > > > > > 
> > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > submit in what order.
> > > > > > 
> > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > poll() if  you want to wait on it manually.
> > > > > > 
> > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > of the status of various components (I don't know the status of
> > > > > > anything in the media world):
> > > > > > 
> > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > synchronization primitives.
> > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > still implicit.
> > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > (more on that later).
> > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > the real synchronization is still implicit.
> > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > sync.
> > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > handling which can lead to over-synchronization.
> > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > explicit sync primitives.
> > > > > 
> > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > days
> > > > > 
> > > > > >  - v4l: ???
> > > > > >  - gstreamer: ???
> > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > 
> > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > synchronisation is something we do already with GL (even if limited). We place
> > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > into the same GL context). That requires a special mode of operation of course.
> > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > (it's plugin base, so this can be really messy otherwise).
> > > > 
> > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > enough.
> > > 
> > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > When you queue multiple buffers for camera capture for instance, you
> > > don't know until capture complete in which buffer the frame has been
> > > captured.
> > 
> > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > start of frame capture which buffer it's getting written into.  I
> > would think that the kernel APIs could be adjusted (if we find good
> > reason to do so!) such that they return earlier and return a (buffer,
> > fence) pair.  Am I missing something fundamental about video here?
> 
> For cameras I believe we could do that, yes. I was pointing out the
> issues caused by the current API. For video decoders I'll let Nicolas
> answer the question, he's way more knowledgeable that I am on that
> topic.

Right now, there is simply no uAPI for supporting asynchronous errors
reporting when fences are invovled. That is true for both camera's and
CODEC. It's likely what all the attempt was missing, I don't know
enough myself to suggest something.

Now, why Stateless video decoders are special is another subject. In
CODECs, the decoding and the presentation order may differ. For
Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
if this bitstream is fully valid, since the it is being parsed and
validated by the firmware. It's also firmware job to decide which
buffer should be presented first.

In most firmware interface, that information is communicated back all
at once when the frame is ready to be presented (which may be quite
some time after it was decoded). So indeed, a fence model is not really
easy to add, unless the firmware was designed with that model in mind.

Nothing of course would prevent V4L2 framework to generically handle
out_fence from other producers. It does not even handle implicit fences
at the moment, which is already quite problematic (I've seen glitches
on i.MX6/8 and Raspberry Pi 4).

In that specific case, if the fences from etnaviv, vc graphic drivers
was exposed, we could solve this issue in userspace. Right now it's
implicit, so we rely on all DMABuf driver to have proper support, which
is not the case. There is V4L2 support for that coming, but the wait is
done synchronously in userspace call that was normally non-blocking. So
that is unlikely to fly.

Small note, stateless video decoders don't have this issue. The
bitstream is validated by userspace, and userspace controls the
"decode" operation. This one would be a good case for bidirectional
fencing.

> 
> > I must admit that V4L is a bit of an odd case since the kernel driver
> > is the producer and not the consumer.
> 
> Note that V4L2 can be a consumer too. Video output with V4L2 is less
> frequent than video capture (but it still exists), and codecs and other
> memory-to-memory processing devices (colorspace converters, scalers,
> ...) are both consumers and producers.
> 
> > > In the normal case buffers are processed in sequence, but if
> > > an error occurs during capture, they can be recycled internally and put
> > > to the back of the queue.
> > 
> > Are those errors something that can happen at any time in the middle
> > of a frame capture?  If so, that does make things stickier.
> 
> Yes it can. Think of packet loss when capturing from a USB webcam for
> instance. 
> 
> > > Unless I'm mistaken, this problem also exists
> > > with stateful codecs. And if you don't know in advance which buffer you
> > > will receive from the device, the usefulness of fences becomes very
> > > questionable :-)
> > 
> > Yeah, if you really are in a situation where there's no way to know
> > until the full frame capture has been completed which buffer is next,
> > then fences are useless.  You aren't in an implicit synchronization
> > setting either; you're in a "full flush" setting.  It's arguably worse
> > for performance but perhaps unavoidable?
> 
> Probably unavoidable in some cases, but nothing that should get in the
> way for the discussion at hand: there's no need to migrate away from
> implicit sync when there's implicit sync in the first place :-)
> 
> I think we need to analyse the use cases here, and figure out at least
> guidelines for userspace, otherwise applications will wonder what
> behaviour to implement, and we'll end up with a wide variety of them.
> Even just on the kernel side, some V4L2 capture driver will pass
> erroneous frames to userspace (thus guaranteeing ordering, but without
> early notification of errors), some will require the frame
> automatically, and at least one (uvcvideo) has a module parameter to
> pick the desired behaviour.

Also, from a userspace point of view, the synchronization with the
"next frame" in V4L2 isn't implicit. We can poll() the device, just
like we'd do with a fence FD. What the explicit fence gives, is a
unified object we can pass to another driver, or other userspace, so we
can delegate the wait.

You refer to performance in few places. In streaming, this is often
measure as real-time throughput. Implicit/explicit fences don't really
play any role for us in this regard. V4L2 drivers, like m2m drivers,
works with buffer queues. So you can queue in advance many buffers on
the OUTPUT device side (which is the input of the m2m), and userspace
will queue in advance pretty much all free buffers available on the
CAPTURE side. The driver is never starved in that model, at the cost of
very large memory consumption of course. Maybe a more visual
representation would be:

  [pending job] -> [M2M Worker] -> [pending results]

So as long as userspace keep the pending job queue non-empty, and that 
it consumes and give back buffers back to write the results into, the
driver will keep running un-interrupted. Performance remains optimal.
What isn't optimal is the latency. And what bugs right now is when a
DMAbuf implicit out fence is put back into the pending results queue,
since the fence is ignored.

> 
> > Trying to understand. :-)
> 
> So am I :-)

Hehe, same here.

> 
> > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > uses), so then the operation will just fail where it worked before (breaking
> > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > cleanly as we do for new APIs.
> > > > 
> > > > > > ## Chicken and egg problems
> > > > > > 
> > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > plumbing we do through window systems is currently of questionable
> > > > > > usefulness.  Why?
> > > > > > 
> > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > being able to use (which is basically all of them) supports the
> > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > implicit sync and Vulkan working nicely together.
> > > > > > 
> > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > doesn't support any explicit sync APIs.
> > > > > 
> > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > off as I initially thought.  There may still be weird screen capture
> > > > > cases but the normal cases of compositing and displaying via
> > > > > KMS/atomic should be in reasonably good shape.
> > > > > 
> > > > > > So, yes, they can technically
> > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > want compositors doing that because my users will complain about
> > > > > > performance issues due to the extra blit.
> > > > > > 
> > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > 
> > > > > > To make matters worse, since most things are doing implicit
> > > > > > synchronization today, it's really easy to get your explicit
> > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > that's going on elsewhere.
> > > > > > 
> > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > actually properly test until everyone else has written their piece and
> > > > > > then we use explicit sync if and only if all components support it.
> > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > like a good plan to me.
> > > > > > 
> > > > > > 
> > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > 
> > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > an actual proposal.
> > > > > > 
> > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > the dma-buf before handing it to KMS.
> > > > > > 
> > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > 
> > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > momentum to get most everything converted.
> > > > > > 
> > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > 
> > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > 
> > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > 
> > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > maybe even help/review. :-)

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 15:33             ` Nicolas Dufresne
@ 2020-03-17 16:27               ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17 16:27 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Laurent Pinchart, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > Hi Jason,
> >
> > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > >
> > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > All,
> > > > > > >
> > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > >
> > > > > > >
> > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > >
> > > > > > >
> > > > > > > ## What are implicit and explicit synchronization?
> > > > > > >
> > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > required to ensure that everything happens in the right order and
> > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > examples:
> > > > > > >
> > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > contents.
> > > > > > >
> > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > >
> > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > buffers in different threads where one of those command buffers
> > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > out.
> > > > > > >
> > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > the client-side especially) becomes more and more painful.
> > > > > > >
> > > > > > >
> > > > > > > ## Current status in Linux
> > > > > > >
> > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > submit in what order.
> > > > > > >
> > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > poll() if  you want to wait on it manually.
> > > > > > >
> > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > of the status of various components (I don't know the status of
> > > > > > > anything in the media world):
> > > > > > >
> > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > synchronization primitives.
> > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > still implicit.
> > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > (more on that later).
> > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > the real synchronization is still implicit.
> > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > sync.
> > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > handling which can lead to over-synchronization.
> > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > explicit sync primitives.
> > > > > >
> > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > days
> > > > > >
> > > > > > >  - v4l: ???
> > > > > > >  - gstreamer: ???
> > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > >
> > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > (it's plugin base, so this can be really messy otherwise).
> > > > >
> > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > enough.
> > > >
> > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > When you queue multiple buffers for camera capture for instance, you
> > > > don't know until capture complete in which buffer the frame has been
> > > > captured.
> > >
> > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > start of frame capture which buffer it's getting written into.  I
> > > would think that the kernel APIs could be adjusted (if we find good
> > > reason to do so!) such that they return earlier and return a (buffer,
> > > fence) pair.  Am I missing something fundamental about video here?
> >
> > For cameras I believe we could do that, yes. I was pointing out the
> > issues caused by the current API. For video decoders I'll let Nicolas
> > answer the question, he's way more knowledgeable that I am on that
> > topic.
>
> Right now, there is simply no uAPI for supporting asynchronous errors
> reporting when fences are invovled. That is true for both camera's and
> CODEC. It's likely what all the attempt was missing, I don't know
> enough myself to suggest something.
>
> Now, why Stateless video decoders are special is another subject. In
> CODECs, the decoding and the presentation order may differ. For
> Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> if this bitstream is fully valid, since the it is being parsed and
> validated by the firmware. It's also firmware job to decide which
> buffer should be presented first.
>
> In most firmware interface, that information is communicated back all
> at once when the frame is ready to be presented (which may be quite
> some time after it was decoded). So indeed, a fence model is not really
> easy to add, unless the firmware was designed with that model in mind.

Just to be clear, I think we should do whatever makes sense here and
not try to slam sync_file in when it doesn't make sense just because
we have it.  The more I read on this thread, the less out-fences from
video decode sound like they make sense unless we have a really solid
plan for async error reporting.  It's possible, depending on how many
processes are involved in the pipeline, that async error reporting
could help reduce latency a bit if it let the kernel report the error
directly to the last process in the chain.  However, I'm not convinced
the potential for userspace programmer error is worth it..  That said,
I'm happy to leave that up to the actual video experts. (I just do 3D)

> Nothing of course would prevent V4L2 framework to generically handle
> out_fence from other producers. It does not even handle implicit fences
> at the moment, which is already quite problematic (I've seen glitches
> on i.MX6/8 and Raspberry Pi 4).
>
> In that specific case, if the fences from etnaviv, vc graphic drivers
> was exposed, we could solve this issue in userspace. Right now it's
> implicit, so we rely on all DMABuf driver to have proper support, which
> is not the case. There is V4L2 support for that coming, but the wait is
> done synchronously in userspace call that was normally non-blocking. So
> that is unlikely to fly.

Yeah... waits in userspace aren't what anyone wants.

> Small note, stateless video decoders don't have this issue. The
> bitstream is validated by userspace, and userspace controls the
> "decode" operation. This one would be a good case for bidirectional
> fencing.

Good to know.

> >
> > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > is the producer and not the consumer.
> >
> > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > frequent than video capture (but it still exists), and codecs and other
> > memory-to-memory processing devices (colorspace converters, scalers,
> > ...) are both consumers and producers.
> >
> > > > In the normal case buffers are processed in sequence, but if
> > > > an error occurs during capture, they can be recycled internally and put
> > > > to the back of the queue.
> > >
> > > Are those errors something that can happen at any time in the middle
> > > of a frame capture?  If so, that does make things stickier.
> >
> > Yes it can. Think of packet loss when capturing from a USB webcam for
> > instance.
> >
> > > > Unless I'm mistaken, this problem also exists
> > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > will receive from the device, the usefulness of fences becomes very
> > > > questionable :-)
> > >
> > > Yeah, if you really are in a situation where there's no way to know
> > > until the full frame capture has been completed which buffer is next,
> > > then fences are useless.  You aren't in an implicit synchronization
> > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > for performance but perhaps unavoidable?
> >
> > Probably unavoidable in some cases, but nothing that should get in the
> > way for the discussion at hand: there's no need to migrate away from
> > implicit sync when there's implicit sync in the first place :-)
> >
> > I think we need to analyse the use cases here, and figure out at least
> > guidelines for userspace, otherwise applications will wonder what
> > behaviour to implement, and we'll end up with a wide variety of them.
> > Even just on the kernel side, some V4L2 capture driver will pass
> > erroneous frames to userspace (thus guaranteeing ordering, but without
> > early notification of errors), some will require the frame
> > automatically, and at least one (uvcvideo) has a module parameter to
> > pick the desired behaviour.
>
> Also, from a userspace point of view, the synchronization with the
> "next frame" in V4L2 isn't implicit. We can poll() the device, just
> like we'd do with a fence FD. What the explicit fence gives, is a
> unified object we can pass to another driver, or other userspace, so we
> can delegate the wait.
>
> You refer to performance in few places. In streaming, this is often
> measure as real-time throughput. Implicit/explicit fences don't really
> play any role for us in this regard. V4L2 drivers, like m2m drivers,
> works with buffer queues. So you can queue in advance many buffers on
> the OUTPUT device side (which is the input of the m2m), and userspace
> will queue in advance pretty much all free buffers available on the
> CAPTURE side. The driver is never starved in that model, at the cost of
> very large memory consumption of course. Maybe a more visual
> representation would be:
>
>   [pending job] -> [M2M Worker] -> [pending results]
>
> So as long as userspace keep the pending job queue non-empty, and that
> it consumes and give back buffers back to write the results into, the
> driver will keep running un-interrupted. Performance remains optimal.
> What isn't optimal is the latency. And what bugs right now is when a
> DMAbuf implicit out fence is put back into the pending results queue,
> since the fence is ignored.

Yes, that makes sense.  In 3D land, we're very concerned about
latency.  Any time anyone has to stall for anything, it's a potential
hitch in someone's game.  Being delayed by a single extra frame can be
problematic; 2-3 frames puts the gamer at a significant disadvantage.
In video, as long as audio and video are in sync and you aren't
dropping frames, no one really cares about latency as long as hitting
the pause button doesn't take too long.

What concerns me the most, I think is actually the interop issues.
You mentioned issues with the raspberry pi.  Right now, if someone is
rendering frames using a Vulkan driver and trying to pass those on to
V4L for encode or to some other api such as VA-API, we don't really
have a plan for synchronization.  Thanks to dma-buf extensions we at
least have most of a plan for sharing the memory and negotiating image
layouts (strides, tiling, etc.) but no plan for synchronization at
all.  The only thing you can do today is to use a VkFence to CPU wait
for the 3D rendering to be 100% done and then pass the image on to the
encoder.

The more I look over the various hacks we've done over the course of
the last 4 years to make window systems work, the less confident I am
that I want to expose ANY of them as an official Vulkan extension that
we support long-term.  The one we do have which I'm reasonably happy
to be stuck with is sync_file import/export.  That said, it's sounding
like V4L doesn't support dma-buf implicit sync at all so maybe CPU
waiting with a VkFence is the current state-of-the-art?

--Jason


> >
> > > Trying to understand. :-)
> >
> > So am I :-)
>
> Hehe, same here.
>
> >
> > > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > > uses), so then the operation will just fail where it worked before (breaking
> > > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > > cleanly as we do for new APIs.
> > > > >
> > > > > > > ## Chicken and egg problems
> > > > > > >
> > > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > > plumbing we do through window systems is currently of questionable
> > > > > > > usefulness.  Why?
> > > > > > >
> > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > > being able to use (which is basically all of them) supports the
> > > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > > implicit sync and Vulkan working nicely together.
> > > > > > >
> > > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > > doesn't support any explicit sync APIs.
> > > > > >
> > > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > > off as I initially thought.  There may still be weird screen capture
> > > > > > cases but the normal cases of compositing and displaying via
> > > > > > KMS/atomic should be in reasonably good shape.
> > > > > >
> > > > > > > So, yes, they can technically
> > > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > > want compositors doing that because my users will complain about
> > > > > > > performance issues due to the extra blit.
> > > > > > >
> > > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > >
> > > > > > > To make matters worse, since most things are doing implicit
> > > > > > > synchronization today, it's really easy to get your explicit
> > > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > > that's going on elsewhere.
> > > > > > >
> > > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > > actually properly test until everyone else has written their piece and
> > > > > > > then we use explicit sync if and only if all components support it.
> > > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > > like a good plan to me.
> > > > > > >
> > > > > > >
> > > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > >
> > > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > > an actual proposal.
> > > > > > >
> > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > > the dma-buf before handing it to KMS.
> > > > > > >
> > > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > >
> > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > > momentum to get most everything converted.
> > > > > > >
> > > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > >
> > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > >
> > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > >
> > > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > > maybe even help/review. :-)
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 16:27               ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17 16:27 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, linux-media,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > Hi Jason,
> >
> > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > >
> > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > All,
> > > > > > >
> > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > >
> > > > > > >
> > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > >
> > > > > > >
> > > > > > > ## What are implicit and explicit synchronization?
> > > > > > >
> > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > required to ensure that everything happens in the right order and
> > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > examples:
> > > > > > >
> > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > contents.
> > > > > > >
> > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > >
> > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > buffers in different threads where one of those command buffers
> > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > out.
> > > > > > >
> > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > the client-side especially) becomes more and more painful.
> > > > > > >
> > > > > > >
> > > > > > > ## Current status in Linux
> > > > > > >
> > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > submit in what order.
> > > > > > >
> > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > poll() if  you want to wait on it manually.
> > > > > > >
> > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > of the status of various components (I don't know the status of
> > > > > > > anything in the media world):
> > > > > > >
> > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > synchronization primitives.
> > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > still implicit.
> > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > (more on that later).
> > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > the real synchronization is still implicit.
> > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > sync.
> > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > handling which can lead to over-synchronization.
> > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > explicit sync primitives.
> > > > > >
> > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > days
> > > > > >
> > > > > > >  - v4l: ???
> > > > > > >  - gstreamer: ???
> > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > >
> > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > (it's plugin base, so this can be really messy otherwise).
> > > > >
> > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > enough.
> > > >
> > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > When you queue multiple buffers for camera capture for instance, you
> > > > don't know until capture complete in which buffer the frame has been
> > > > captured.
> > >
> > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > start of frame capture which buffer it's getting written into.  I
> > > would think that the kernel APIs could be adjusted (if we find good
> > > reason to do so!) such that they return earlier and return a (buffer,
> > > fence) pair.  Am I missing something fundamental about video here?
> >
> > For cameras I believe we could do that, yes. I was pointing out the
> > issues caused by the current API. For video decoders I'll let Nicolas
> > answer the question, he's way more knowledgeable that I am on that
> > topic.
>
> Right now, there is simply no uAPI for supporting asynchronous errors
> reporting when fences are invovled. That is true for both camera's and
> CODEC. It's likely what all the attempt was missing, I don't know
> enough myself to suggest something.
>
> Now, why Stateless video decoders are special is another subject. In
> CODECs, the decoding and the presentation order may differ. For
> Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> if this bitstream is fully valid, since the it is being parsed and
> validated by the firmware. It's also firmware job to decide which
> buffer should be presented first.
>
> In most firmware interface, that information is communicated back all
> at once when the frame is ready to be presented (which may be quite
> some time after it was decoded). So indeed, a fence model is not really
> easy to add, unless the firmware was designed with that model in mind.

Just to be clear, I think we should do whatever makes sense here and
not try to slam sync_file in when it doesn't make sense just because
we have it.  The more I read on this thread, the less out-fences from
video decode sound like they make sense unless we have a really solid
plan for async error reporting.  It's possible, depending on how many
processes are involved in the pipeline, that async error reporting
could help reduce latency a bit if it let the kernel report the error
directly to the last process in the chain.  However, I'm not convinced
the potential for userspace programmer error is worth it..  That said,
I'm happy to leave that up to the actual video experts. (I just do 3D)

> Nothing of course would prevent V4L2 framework to generically handle
> out_fence from other producers. It does not even handle implicit fences
> at the moment, which is already quite problematic (I've seen glitches
> on i.MX6/8 and Raspberry Pi 4).
>
> In that specific case, if the fences from etnaviv, vc graphic drivers
> was exposed, we could solve this issue in userspace. Right now it's
> implicit, so we rely on all DMABuf driver to have proper support, which
> is not the case. There is V4L2 support for that coming, but the wait is
> done synchronously in userspace call that was normally non-blocking. So
> that is unlikely to fly.

Yeah... waits in userspace aren't what anyone wants.

> Small note, stateless video decoders don't have this issue. The
> bitstream is validated by userspace, and userspace controls the
> "decode" operation. This one would be a good case for bidirectional
> fencing.

Good to know.

> >
> > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > is the producer and not the consumer.
> >
> > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > frequent than video capture (but it still exists), and codecs and other
> > memory-to-memory processing devices (colorspace converters, scalers,
> > ...) are both consumers and producers.
> >
> > > > In the normal case buffers are processed in sequence, but if
> > > > an error occurs during capture, they can be recycled internally and put
> > > > to the back of the queue.
> > >
> > > Are those errors something that can happen at any time in the middle
> > > of a frame capture?  If so, that does make things stickier.
> >
> > Yes it can. Think of packet loss when capturing from a USB webcam for
> > instance.
> >
> > > > Unless I'm mistaken, this problem also exists
> > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > will receive from the device, the usefulness of fences becomes very
> > > > questionable :-)
> > >
> > > Yeah, if you really are in a situation where there's no way to know
> > > until the full frame capture has been completed which buffer is next,
> > > then fences are useless.  You aren't in an implicit synchronization
> > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > for performance but perhaps unavoidable?
> >
> > Probably unavoidable in some cases, but nothing that should get in the
> > way for the discussion at hand: there's no need to migrate away from
> > implicit sync when there's implicit sync in the first place :-)
> >
> > I think we need to analyse the use cases here, and figure out at least
> > guidelines for userspace, otherwise applications will wonder what
> > behaviour to implement, and we'll end up with a wide variety of them.
> > Even just on the kernel side, some V4L2 capture driver will pass
> > erroneous frames to userspace (thus guaranteeing ordering, but without
> > early notification of errors), some will require the frame
> > automatically, and at least one (uvcvideo) has a module parameter to
> > pick the desired behaviour.
>
> Also, from a userspace point of view, the synchronization with the
> "next frame" in V4L2 isn't implicit. We can poll() the device, just
> like we'd do with a fence FD. What the explicit fence gives, is a
> unified object we can pass to another driver, or other userspace, so we
> can delegate the wait.
>
> You refer to performance in few places. In streaming, this is often
> measure as real-time throughput. Implicit/explicit fences don't really
> play any role for us in this regard. V4L2 drivers, like m2m drivers,
> works with buffer queues. So you can queue in advance many buffers on
> the OUTPUT device side (which is the input of the m2m), and userspace
> will queue in advance pretty much all free buffers available on the
> CAPTURE side. The driver is never starved in that model, at the cost of
> very large memory consumption of course. Maybe a more visual
> representation would be:
>
>   [pending job] -> [M2M Worker] -> [pending results]
>
> So as long as userspace keep the pending job queue non-empty, and that
> it consumes and give back buffers back to write the results into, the
> driver will keep running un-interrupted. Performance remains optimal.
> What isn't optimal is the latency. And what bugs right now is when a
> DMAbuf implicit out fence is put back into the pending results queue,
> since the fence is ignored.

Yes, that makes sense.  In 3D land, we're very concerned about
latency.  Any time anyone has to stall for anything, it's a potential
hitch in someone's game.  Being delayed by a single extra frame can be
problematic; 2-3 frames puts the gamer at a significant disadvantage.
In video, as long as audio and video are in sync and you aren't
dropping frames, no one really cares about latency as long as hitting
the pause button doesn't take too long.

What concerns me the most, I think is actually the interop issues.
You mentioned issues with the raspberry pi.  Right now, if someone is
rendering frames using a Vulkan driver and trying to pass those on to
V4L for encode or to some other api such as VA-API, we don't really
have a plan for synchronization.  Thanks to dma-buf extensions we at
least have most of a plan for sharing the memory and negotiating image
layouts (strides, tiling, etc.) but no plan for synchronization at
all.  The only thing you can do today is to use a VkFence to CPU wait
for the 3D rendering to be 100% done and then pass the image on to the
encoder.

The more I look over the various hacks we've done over the course of
the last 4 years to make window systems work, the less confident I am
that I want to expose ANY of them as an official Vulkan extension that
we support long-term.  The one we do have which I'm reasonably happy
to be stuck with is sync_file import/export.  That said, it's sounding
like V4L doesn't support dma-buf implicit sync at all so maybe CPU
waiting with a VkFence is the current state-of-the-art?

--Jason


> >
> > > Trying to understand. :-)
> >
> > So am I :-)
>
> Hehe, same here.
>
> >
> > > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > > uses), so then the operation will just fail where it worked before (breaking
> > > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > > cleanly as we do for new APIs.
> > > > >
> > > > > > > ## Chicken and egg problems
> > > > > > >
> > > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > > plumbing we do through window systems is currently of questionable
> > > > > > > usefulness.  Why?
> > > > > > >
> > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > > being able to use (which is basically all of them) supports the
> > > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > > implicit sync and Vulkan working nicely together.
> > > > > > >
> > > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > > doesn't support any explicit sync APIs.
> > > > > >
> > > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > > off as I initially thought.  There may still be weird screen capture
> > > > > > cases but the normal cases of compositing and displaying via
> > > > > > KMS/atomic should be in reasonably good shape.
> > > > > >
> > > > > > > So, yes, they can technically
> > > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > > want compositors doing that because my users will complain about
> > > > > > > performance issues due to the extra blit.
> > > > > > >
> > > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > >
> > > > > > > To make matters worse, since most things are doing implicit
> > > > > > > synchronization today, it's really easy to get your explicit
> > > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > > that's going on elsewhere.
> > > > > > >
> > > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > > actually properly test until everyone else has written their piece and
> > > > > > > then we use explicit sync if and only if all components support it.
> > > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > > like a good plan to me.
> > > > > > >
> > > > > > >
> > > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > >
> > > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > > an actual proposal.
> > > > > > >
> > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > > the dma-buf before handing it to KMS.
> > > > > > >
> > > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > >
> > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > > momentum to get most everything converted.
> > > > > > >
> > > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > >
> > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > >
> > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > >
> > > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > > maybe even help/review. :-)
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 16:27               ` Jason Ekstrand
@ 2020-03-17 17:12                 ` Jacob Lifshay
  -1 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-17 17:12 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, linux-media,
	Discussion of the development of and with GStreamer

One related issue with explicit sync using sync_file is that combined
CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
rendering in userspace (like llvmpipe but for Vulkan and with extra
instructions for GPU tasks) but need to synchronize with other
drivers/processes is that there should be some way to create an
explicit fence/semaphore from userspace and later signal it. This
seems to conflict with the requirement for a sync_file to complete in
finite time, since the user process could be stopped or killed.

Any ideas?

Jacob Lifshay

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 17:12                 ` Jacob Lifshay
  0 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-17 17:12 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

One related issue with explicit sync using sync_file is that combined
CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
rendering in userspace (like llvmpipe but for Vulkan and with extra
instructions for GPU tasks) but need to synchronize with other
drivers/processes is that there should be some way to create an
explicit fence/semaphore from userspace and later signal it. This
seems to conflict with the requirement for a sync_file to complete in
finite time, since the user process could be stopped or killed.

Any ideas?

Jacob Lifshay
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 10:01             ` Michel Dänzer
  (?)
@ 2020-03-17 17:13             ` Marek Olšák
  -1 siblings, 0 replies; 101+ messages in thread
From: Marek Olšák @ 2020-03-17 17:13 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media


[-- Attachment #1.1: Type: text/plain, Size: 2864 bytes --]

On Tue., Mar. 17, 2020, 06:02 Michel Dänzer, <michel@daenzer.net> wrote:

> On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net>
> wrote:
> >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> >>> The synchronization works because the Mesa driver waits for idle
> (drains
> >>> the GFX pipeline) at the end of command buffers and there is only 1
> >>> graphics queue, so everything is ordered.
> >>>
> >>> The GFX pipeline runs asynchronously to the command buffer, meaning the
> >>> command buffer only starts draws and doesn't wait for completion. If
> the
> >>> Mesa driver didn't wait at the end of the command buffer, the command
> >>> buffer would finish and a different process could start execution of
> its
> >>> own command buffer while shaders of the previous process are still
> >> running.
> >>>
> >>> If the Mesa driver submits a command buffer internally (because it's
> >> full),
> >>> it doesn't wait, so the GFX pipeline doesn't notice that a command
> buffer
> >>> ended and a new one started.
> >>>
> >>> The waiting at the end of command buffers happens only when the flush
> is
> >>> external (Swap buffers, glFlush).
> >>>
> >>> It's a performance problem, because the GFX queue is blocked until the
> >> GFX
> >>> pipeline is drained at the end of every frame at least.
> >>>
> >>> So explicit fences for SwapBuffers would help.
> >>
> >> Not sure what difference it would make, since the same thing needs to be
> >> done for explicit fences as well, doesn't it?
> >
> > No. Explicit fences don't require userspace to wait for idle in the
> command
> > buffer. Fences are signalled when the last draw is complete and caches
> are
> > flushed. Before that happens, any command buffer that is not dependent on
> > the fence can start execution. There is never a need for the GPU to be
> idle
> > if there is enough independent work to do.
>
> I don't think explicit fences in the context of this discussion imply
> using that different fence signalling mechanism though. My understanding
> is that the API proposed by Jason allows implicit fences to be used as
> explicit ones and vice versa, so presumably they have to use the same
> signalling mechanism.
>
>
> Anyway, maybe the different fence signalling mechanism you describe
> could be used by the amdgpu kernel driver in general, then Mesa could
> drop the waits for idle and get the benefits with implicit sync as well?
>

Yes. If there is any waiting, or should be done in the GPU scheduler, not
in the command buffer, so that independent command buffers can use the GFX
queue.

Marek


>
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer
>

[-- Attachment #1.2: Type: text/html, Size: 3970 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:12                 ` Jacob Lifshay
@ 2020-03-17 17:18                   ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17 17:18 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, open list:DMA BUFFER SHARING FRAMEWORK,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 12:13 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
>
> One related issue with explicit sync using sync_file is that combined
> CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> rendering in userspace (like llvmpipe but for Vulkan and with extra
> instructions for GPU tasks) but need to synchronize with other
> drivers/processes is that there should be some way to create an
> explicit fence/semaphore from userspace and later signal it. This
> seems to conflict with the requirement for a sync_file to complete in
> finite time, since the user process could be stopped or killed.

Yeah... That's going to be a problem.  The only way I could see that
working is if you created a sync_file that had a timeout associated
with it.  However, then you run into the issue where you may have
corruption if stuff doesn't complete on time.  Then again, you're not
really dealing with an external unit and so the latency cost of going
across the window system protocol probably isn't massively different
from the latency cost of triggering the sync_file.  Maybe the answer
there is to just do everything in-order and not worry about
synchronization?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 17:18                   ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-17 17:18 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Mar 17, 2020 at 12:13 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
>
> One related issue with explicit sync using sync_file is that combined
> CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> rendering in userspace (like llvmpipe but for Vulkan and with extra
> instructions for GPU tasks) but need to synchronize with other
> drivers/processes is that there should be some way to create an
> explicit fence/semaphore from userspace and later signal it. This
> seems to conflict with the requirement for a sync_file to complete in
> finite time, since the user process could be stopped or killed.

Yeah... That's going to be a problem.  The only way I could see that
working is if you created a sync_file that had a timeout associated
with it.  However, then you run into the issue where you may have
corruption if stuff doesn't complete on time.  Then again, you're not
really dealing with an external unit and so the latency cost of going
across the window system protocol probably isn't massively different
from the latency cost of triggering the sync_file.  Maybe the answer
there is to just do everything in-order and not worry about
synchronization?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:12                 ` Jacob Lifshay
@ 2020-03-17 17:21                   ` Lucas Stach
  -1 siblings, 0 replies; 101+ messages in thread
From: Lucas Stach @ 2020-03-17 17:21 UTC (permalink / raw)
  To: Jacob Lifshay, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

Am Dienstag, den 17.03.2020, 10:12 -0700 schrieb Jacob Lifshay:
> One related issue with explicit sync using sync_file is that combined
> CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> rendering in userspace (like llvmpipe but for Vulkan and with extra
> instructions for GPU tasks) but need to synchronize with other
> drivers/processes is that there should be some way to create an
> explicit fence/semaphore from userspace and later signal it. This
> seems to conflict with the requirement for a sync_file to complete in
> finite time, since the user process could be stopped or killed.
> 
> Any ideas?

Finite just means "not infinite". If you stop the process that's doing
part of the pipeline processing you block the pipeline, you get to keep
the pieces in that case. That's one of the issues with implicit sync
that explicit may solve: a single client taking way too much time to
render something can block the whole pipeline up until the display
flip. With explicit sync the compositor can just decide to use the last
client buffer if the latest buffer isn't ready by some deadline.

With regard to the process getting killed: whatever you sync primitive
is, you need to make sure to signal the fence (possibly with an error
condition set) when you are not going to make progress anymore. So
whatever your means to creating the sync_fd from your software renderer
is, it needs to signal any outstanding fences on the sync_fd when the
fd is closed.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 17:21                   ` Lucas Stach
  0 siblings, 0 replies; 101+ messages in thread
From: Lucas Stach @ 2020-03-17 17:21 UTC (permalink / raw)
  To: Jacob Lifshay, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, Laurent Pinchart

Am Dienstag, den 17.03.2020, 10:12 -0700 schrieb Jacob Lifshay:
> One related issue with explicit sync using sync_file is that combined
> CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> rendering in userspace (like llvmpipe but for Vulkan and with extra
> instructions for GPU tasks) but need to synchronize with other
> drivers/processes is that there should be some way to create an
> explicit fence/semaphore from userspace and later signal it. This
> seems to conflict with the requirement for a sync_file to complete in
> finite time, since the user process could be stopped or killed.
> 
> Any ideas?

Finite just means "not infinite". If you stop the process that's doing
part of the pipeline processing you block the pipeline, you get to keep
the pieces in that case. That's one of the issues with implicit sync
that explicit may solve: a single client taking way too much time to
render something can block the whole pipeline up until the display
flip. With explicit sync the compositor can just decide to use the last
client buffer if the latest buffer isn't ready by some deadline.

With regard to the process getting killed: whatever you sync primitive
is, you need to make sure to signal the fence (possibly with an error
condition set) when you are not going to make progress anymore. So
whatever your means to creating the sync_fd from your software renderer
is, it needs to signal any outstanding fences on the sync_fd when the
fd is closed.

Regards,
Lucas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 15:33             ` Nicolas Dufresne
@ 2020-03-17 17:34               ` Lucas Stach
  -1 siblings, 0 replies; 101+ messages in thread
From: Lucas Stach @ 2020-03-17 17:34 UTC (permalink / raw)
  To: Nicolas Dufresne, Laurent Pinchart, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

Am Dienstag, den 17.03.2020, 11:33 -0400 schrieb Nicolas Dufresne:
> Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > Hi Jason,
> > 
> > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > > 
> > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > All,
> > > > > > > 
> > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > > 
> > > > > > > 
> > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > > 
> > > > > > > 
> > > > > > > ## What are implicit and explicit synchronization?
> > > > > > > 
> > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > required to ensure that everything happens in the right order and
> > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > examples:
> > > > > > > 
> > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > contents.
> > > > > > > 
> > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > > 
> > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > buffers in different threads where one of those command buffers
> > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > out.
> > > > > > > 
> > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > the client-side especially) becomes more and more painful.
> > > > > > > 
> > > > > > > 
> > > > > > > ## Current status in Linux
> > > > > > > 
> > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > submit in what order.
> > > > > > > 
> > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > poll() if  you want to wait on it manually.
> > > > > > > 
> > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > of the status of various components (I don't know the status of
> > > > > > > anything in the media world):
> > > > > > > 
> > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > synchronization primitives.
> > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > still implicit.
> > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > (more on that later).
> > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > the real synchronization is still implicit.
> > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > sync.
> > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > handling which can lead to over-synchronization.
> > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > explicit sync primitives.
> > > > > > 
> > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > days
> > > > > > 
> > > > > > >  - v4l: ???
> > > > > > >  - gstreamer: ???
> > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > > 
> > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > (it's plugin base, so this can be really messy otherwise).
> > > > > 
> > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > enough.
> > > > 
> > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > When you queue multiple buffers for camera capture for instance, you
> > > > don't know until capture complete in which buffer the frame has been
> > > > captured.
> > > 
> > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > start of frame capture which buffer it's getting written into.  I
> > > would think that the kernel APIs could be adjusted (if we find good
> > > reason to do so!) such that they return earlier and return a (buffer,
> > > fence) pair.  Am I missing something fundamental about video here?
> > 
> > For cameras I believe we could do that, yes. I was pointing out the
> > issues caused by the current API. For video decoders I'll let Nicolas
> > answer the question, he's way more knowledgeable that I am on that
> > topic.
> 
> Right now, there is simply no uAPI for supporting asynchronous errors
> reporting when fences are invovled. That is true for both camera's and
> CODEC. It's likely what all the attempt was missing, I don't know
> enough myself to suggest something.
> 
> Now, why Stateless video decoders are special is another subject. In
> CODECs, the decoding and the presentation order may differ. For
> Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> if this bitstream is fully valid, since the it is being parsed and
> validated by the firmware. It's also firmware job to decide which
> buffer should be presented first.
> 
> In most firmware interface, that information is communicated back all
> at once when the frame is ready to be presented (which may be quite
> some time after it was decoded). So indeed, a fence model is not really
> easy to add, unless the firmware was designed with that model in mind.
> 
> Nothing of course would prevent V4L2 framework to generically handle
> out_fence from other producers. It does not even handle implicit fences
> at the moment, which is already quite problematic (I've seen glitches
> on i.MX6/8 and Raspberry Pi 4).
> 
> In that specific case, if the fences from etnaviv, vc graphic drivers
> was exposed, we could solve this issue in userspace. Right now it's
> implicit, so we rely on all DMABuf driver to have proper support, which
> is not the case. There is V4L2 support for that coming, but the wait is
> done synchronously in userspace call that was normally non-blocking. So
> that is unlikely to fly.

If it helps to settle this part of the discussion I happily volunteer
to fix the V4L2 side to wait for the fences without the need for a
synchronous wait in qbuf.

> Small note, stateless video decoders don't have this issue. The
> bitstream is validated by userspace, and userspace controls the
> "decode" operation. This one would be a good case for bidirectional
> fencing.
> 
> > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > is the producer and not the consumer.
> > 
> > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > frequent than video capture (but it still exists), and codecs and other
> > memory-to-memory processing devices (colorspace converters, scalers,
> > ...) are both consumers and producers.
> > 
> > > > In the normal case buffers are processed in sequence, but if
> > > > an error occurs during capture, they can be recycled internally and put
> > > > to the back of the queue.
> > > 
> > > Are those errors something that can happen at any time in the middle
> > > of a frame capture?  If so, that does make things stickier.
> > 
> > Yes it can. Think of packet loss when capturing from a USB webcam for
> > instance. 
> > 
> > > > Unless I'm mistaken, this problem also exists
> > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > will receive from the device, the usefulness of fences becomes very
> > > > questionable :-)
> > > 
> > > Yeah, if you really are in a situation where there's no way to know
> > > until the full frame capture has been completed which buffer is next,
> > > then fences are useless.  You aren't in an implicit synchronization
> > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > for performance but perhaps unavoidable?
> > 
> > Probably unavoidable in some cases, but nothing that should get in the
> > way for the discussion at hand: there's no need to migrate away from
> > implicit sync when there's implicit sync in the first place :-)
> > 
> > I think we need to analyse the use cases here, and figure out at least
> > guidelines for userspace, otherwise applications will wonder what
> > behaviour to implement, and we'll end up with a wide variety of them.
> > Even just on the kernel side, some V4L2 capture driver will pass
> > erroneous frames to userspace (thus guaranteeing ordering, but without
> > early notification of errors), some will require the frame
> > automatically, and at least one (uvcvideo) has a module parameter to
> > pick the desired behaviour.
> 
> Also, from a userspace point of view, the synchronization with the
> "next frame" in V4L2 isn't implicit. We can poll() the device, just
> like we'd do with a fence FD. What the explicit fence gives, is a
> unified object we can pass to another driver, or other userspace, so we
> can delegate the wait.
> 
> You refer to performance in few places. In streaming, this is often
> measure as real-time throughput. Implicit/explicit fences don't really
> play any role for us in this regard. V4L2 drivers, like m2m drivers,
> works with buffer queues. So you can queue in advance many buffers on
> the OUTPUT device side (which is the input of the m2m), and userspace
> will queue in advance pretty much all free buffers available on the
> CAPTURE side. The driver is never starved in that model, at the cost of
> very large memory consumption of course. Maybe a more visual
> representation would be:
> 
>   [pending job] -> [M2M Worker] -> [pending results]
> 
> So as long as userspace keep the pending job queue non-empty, and that 
> it consumes and give back buffers back to write the results into, the
> driver will keep running un-interrupted. Performance remains optimal.
> What isn't optimal is the latency. And what bugs right now is when a
> DMAbuf implicit out fence is put back into the pending results queue,
> since the fence is ignored.

> > > Trying to understand. :-)
> > 
> > So am I :-)
> 
> Hehe, same here.

V4L2 just has no notion of something being done asynchronously, which
would require fence. The current protocol is that you only queue
buffers into the kernel when they are idle and can be consumed by the
HW, so there is no need to wait for anything. This requirement is hard
to meet with buffers that are shared with DRM today, as all DRM
userspace relies on the kernel attached fences to be respected until
explicitly told otherwise.

Also V4L2 only allows to dequeue buffers from the kernel into
userspace, which are done from the HW perspective. So the V4L2
userspace interface already has an implicit CPU sync on the buffer.

Regards,
Lucas



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 17:34               ` Lucas Stach
  0 siblings, 0 replies; 101+ messages in thread
From: Lucas Stach @ 2020-03-17 17:34 UTC (permalink / raw)
  To: Nicolas Dufresne, Laurent Pinchart, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

Am Dienstag, den 17.03.2020, 11:33 -0400 schrieb Nicolas Dufresne:
> Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > Hi Jason,
> > 
> > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > > 
> > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > All,
> > > > > > > 
> > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > > 
> > > > > > > 
> > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > > 
> > > > > > > 
> > > > > > > ## What are implicit and explicit synchronization?
> > > > > > > 
> > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > required to ensure that everything happens in the right order and
> > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > examples:
> > > > > > > 
> > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > contents.
> > > > > > > 
> > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > > 
> > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > buffers in different threads where one of those command buffers
> > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > out.
> > > > > > > 
> > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > the client-side especially) becomes more and more painful.
> > > > > > > 
> > > > > > > 
> > > > > > > ## Current status in Linux
> > > > > > > 
> > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > submit in what order.
> > > > > > > 
> > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > poll() if  you want to wait on it manually.
> > > > > > > 
> > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > of the status of various components (I don't know the status of
> > > > > > > anything in the media world):
> > > > > > > 
> > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > synchronization primitives.
> > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > still implicit.
> > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > (more on that later).
> > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > the real synchronization is still implicit.
> > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > sync.
> > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > handling which can lead to over-synchronization.
> > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > explicit sync primitives.
> > > > > > 
> > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > days
> > > > > > 
> > > > > > >  - v4l: ???
> > > > > > >  - gstreamer: ???
> > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > > 
> > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > (it's plugin base, so this can be really messy otherwise).
> > > > > 
> > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > enough.
> > > > 
> > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > When you queue multiple buffers for camera capture for instance, you
> > > > don't know until capture complete in which buffer the frame has been
> > > > captured.
> > > 
> > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > start of frame capture which buffer it's getting written into.  I
> > > would think that the kernel APIs could be adjusted (if we find good
> > > reason to do so!) such that they return earlier and return a (buffer,
> > > fence) pair.  Am I missing something fundamental about video here?
> > 
> > For cameras I believe we could do that, yes. I was pointing out the
> > issues caused by the current API. For video decoders I'll let Nicolas
> > answer the question, he's way more knowledgeable that I am on that
> > topic.
> 
> Right now, there is simply no uAPI for supporting asynchronous errors
> reporting when fences are invovled. That is true for both camera's and
> CODEC. It's likely what all the attempt was missing, I don't know
> enough myself to suggest something.
> 
> Now, why Stateless video decoders are special is another subject. In
> CODECs, the decoding and the presentation order may differ. For
> Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> if this bitstream is fully valid, since the it is being parsed and
> validated by the firmware. It's also firmware job to decide which
> buffer should be presented first.
> 
> In most firmware interface, that information is communicated back all
> at once when the frame is ready to be presented (which may be quite
> some time after it was decoded). So indeed, a fence model is not really
> easy to add, unless the firmware was designed with that model in mind.
> 
> Nothing of course would prevent V4L2 framework to generically handle
> out_fence from other producers. It does not even handle implicit fences
> at the moment, which is already quite problematic (I've seen glitches
> on i.MX6/8 and Raspberry Pi 4).
> 
> In that specific case, if the fences from etnaviv, vc graphic drivers
> was exposed, we could solve this issue in userspace. Right now it's
> implicit, so we rely on all DMABuf driver to have proper support, which
> is not the case. There is V4L2 support for that coming, but the wait is
> done synchronously in userspace call that was normally non-blocking. So
> that is unlikely to fly.

If it helps to settle this part of the discussion I happily volunteer
to fix the V4L2 side to wait for the fences without the need for a
synchronous wait in qbuf.

> Small note, stateless video decoders don't have this issue. The
> bitstream is validated by userspace, and userspace controls the
> "decode" operation. This one would be a good case for bidirectional
> fencing.
> 
> > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > is the producer and not the consumer.
> > 
> > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > frequent than video capture (but it still exists), and codecs and other
> > memory-to-memory processing devices (colorspace converters, scalers,
> > ...) are both consumers and producers.
> > 
> > > > In the normal case buffers are processed in sequence, but if
> > > > an error occurs during capture, they can be recycled internally and put
> > > > to the back of the queue.
> > > 
> > > Are those errors something that can happen at any time in the middle
> > > of a frame capture?  If so, that does make things stickier.
> > 
> > Yes it can. Think of packet loss when capturing from a USB webcam for
> > instance. 
> > 
> > > > Unless I'm mistaken, this problem also exists
> > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > will receive from the device, the usefulness of fences becomes very
> > > > questionable :-)
> > > 
> > > Yeah, if you really are in a situation where there's no way to know
> > > until the full frame capture has been completed which buffer is next,
> > > then fences are useless.  You aren't in an implicit synchronization
> > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > for performance but perhaps unavoidable?
> > 
> > Probably unavoidable in some cases, but nothing that should get in the
> > way for the discussion at hand: there's no need to migrate away from
> > implicit sync when there's implicit sync in the first place :-)
> > 
> > I think we need to analyse the use cases here, and figure out at least
> > guidelines for userspace, otherwise applications will wonder what
> > behaviour to implement, and we'll end up with a wide variety of them.
> > Even just on the kernel side, some V4L2 capture driver will pass
> > erroneous frames to userspace (thus guaranteeing ordering, but without
> > early notification of errors), some will require the frame
> > automatically, and at least one (uvcvideo) has a module parameter to
> > pick the desired behaviour.
> 
> Also, from a userspace point of view, the synchronization with the
> "next frame" in V4L2 isn't implicit. We can poll() the device, just
> like we'd do with a fence FD. What the explicit fence gives, is a
> unified object we can pass to another driver, or other userspace, so we
> can delegate the wait.
> 
> You refer to performance in few places. In streaming, this is often
> measure as real-time throughput. Implicit/explicit fences don't really
> play any role for us in this regard. V4L2 drivers, like m2m drivers,
> works with buffer queues. So you can queue in advance many buffers on
> the OUTPUT device side (which is the input of the m2m), and userspace
> will queue in advance pretty much all free buffers available on the
> CAPTURE side. The driver is never starved in that model, at the cost of
> very large memory consumption of course. Maybe a more visual
> representation would be:
> 
>   [pending job] -> [M2M Worker] -> [pending results]
> 
> So as long as userspace keep the pending job queue non-empty, and that 
> it consumes and give back buffers back to write the results into, the
> driver will keep running un-interrupted. Performance remains optimal.
> What isn't optimal is the latency. And what bugs right now is when a
> DMAbuf implicit out fence is put back into the pending results queue,
> since the fence is ignored.

> > > Trying to understand. :-)
> > 
> > So am I :-)
> 
> Hehe, same here.

V4L2 just has no notion of something being done asynchronously, which
would require fence. The current protocol is that you only queue
buffers into the kernel when they are idle and can be consumed by the
HW, so there is no need to wait for anything. This requirement is hard
to meet with buffers that are shared with DRM today, as all DRM
userspace relies on the kernel attached fences to be respected until
explicitly told otherwise.

Also V4L2 only allows to dequeue buffers from the kernel into
userspace, which are done from the HW perspective. So the V4L2
userspace interface already has an implicit CPU sync on the buffer.

Regards,
Lucas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:21                   ` Lucas Stach
@ 2020-03-17 17:59                     ` Jacob Lifshay
  -1 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-17 17:59 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Jason Ekstrand, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

On Tue, Mar 17, 2020 at 10:21 AM Lucas Stach <dev@lynxeye.de> wrote:
>
> Am Dienstag, den 17.03.2020, 10:12 -0700 schrieb Jacob Lifshay:
> > One related issue with explicit sync using sync_file is that combined
> > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > instructions for GPU tasks) but need to synchronize with other
> > drivers/processes is that there should be some way to create an
> > explicit fence/semaphore from userspace and later signal it. This
> > seems to conflict with the requirement for a sync_file to complete in
> > finite time, since the user process could be stopped or killed.
> >
> > Any ideas?
>
> Finite just means "not infinite". If you stop the process that's doing
> part of the pipeline processing you block the pipeline, you get to keep
> the pieces in that case.

Seems reasonable.

> That's one of the issues with implicit sync
> that explicit may solve: a single client taking way too much time to
> render something can block the whole pipeline up until the display
> flip. With explicit sync the compositor can just decide to use the last
> client buffer if the latest buffer isn't ready by some deadline.
>
> With regard to the process getting killed: whatever you sync primitive
> is, you need to make sure to signal the fence (possibly with an error
> condition set) when you are not going to make progress anymore. So
> whatever your means to creating the sync_fd from your software renderer
> is, it needs to signal any outstanding fences on the sync_fd when the
> fd is closed.

I think I found a userspace-accessible way to create sync_files and
dma_fences that would fulfill the requirements:
https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c

I'm just not sure if that's a good interface to use, since it appears
to be designed only for debugging. Will have to check for additional
requirements of signalling an error when the process that created the
fence is killed.

Jacob

>
> Regards,
> Lucas
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 17:59                     ` Jacob Lifshay
  0 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-17 17:59 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 10:21 AM Lucas Stach <dev@lynxeye.de> wrote:
>
> Am Dienstag, den 17.03.2020, 10:12 -0700 schrieb Jacob Lifshay:
> > One related issue with explicit sync using sync_file is that combined
> > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > instructions for GPU tasks) but need to synchronize with other
> > drivers/processes is that there should be some way to create an
> > explicit fence/semaphore from userspace and later signal it. This
> > seems to conflict with the requirement for a sync_file to complete in
> > finite time, since the user process could be stopped or killed.
> >
> > Any ideas?
>
> Finite just means "not infinite". If you stop the process that's doing
> part of the pipeline processing you block the pipeline, you get to keep
> the pieces in that case.

Seems reasonable.

> That's one of the issues with implicit sync
> that explicit may solve: a single client taking way too much time to
> render something can block the whole pipeline up until the display
> flip. With explicit sync the compositor can just decide to use the last
> client buffer if the latest buffer isn't ready by some deadline.
>
> With regard to the process getting killed: whatever you sync primitive
> is, you need to make sure to signal the fence (possibly with an error
> condition set) when you are not going to make progress anymore. So
> whatever your means to creating the sync_fd from your software renderer
> is, it needs to signal any outstanding fences on the sync_fd when the
> fd is closed.

I think I found a userspace-accessible way to create sync_files and
dma_fences that would fulfill the requirements:
https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c

I'm just not sure if that's a good interface to use, since it appears
to be designed only for debugging. Will have to check for additional
requirements of signalling an error when the process that created the
fence is killed.

Jacob

>
> Regards,
> Lucas
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:59                     ` Jacob Lifshay
@ 2020-03-17 18:14                       ` Lucas Stach
  -1 siblings, 0 replies; 101+ messages in thread
From: Lucas Stach @ 2020-03-17 18:14 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Jason Ekstrand, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> On Tue, Mar 17, 2020 at 10:21 AM Lucas Stach <dev@lynxeye.de> wrote:
> > Am Dienstag, den 17.03.2020, 10:12 -0700 schrieb Jacob Lifshay:
> > > One related issue with explicit sync using sync_file is that combined
> > > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > > instructions for GPU tasks) but need to synchronize with other
> > > drivers/processes is that there should be some way to create an
> > > explicit fence/semaphore from userspace and later signal it. This
> > > seems to conflict with the requirement for a sync_file to complete in
> > > finite time, since the user process could be stopped or killed.
> > > 
> > > Any ideas?
> > 
> > Finite just means "not infinite". If you stop the process that's doing
> > part of the pipeline processing you block the pipeline, you get to keep
> > the pieces in that case.
> 
> Seems reasonable.
> 
> > That's one of the issues with implicit sync
> > that explicit may solve: a single client taking way too much time to
> > render something can block the whole pipeline up until the display
> > flip. With explicit sync the compositor can just decide to use the last
> > client buffer if the latest buffer isn't ready by some deadline.
> > 
> > With regard to the process getting killed: whatever you sync primitive
> > is, you need to make sure to signal the fence (possibly with an error
> > condition set) when you are not going to make progress anymore. So
> > whatever your means to creating the sync_fd from your software renderer
> > is, it needs to signal any outstanding fences on the sync_fd when the
> > fd is closed.
> 
> I think I found a userspace-accessible way to create sync_files and
> dma_fences that would fulfill the requirements:
> https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> 
> I'm just not sure if that's a good interface to use, since it appears
> to be designed only for debugging. Will have to check for additional
> requirements of signalling an error when the process that created the
> fence is killed.

Something like that can certainly be lifted for general use if it makes
sense. But then with a software renderer I don't really see how fences
help you at all. With a software renderer you know exactly when the
frame is finished and you can just defer pushing it over to the next
pipeline element until that time. You won't gain any parallelism by
using fences as the CPU is busy doing the rendering and will not run
other stuff concurrently, right?

Regards,
Lucas


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 18:14                       ` Lucas Stach
  0 siblings, 0 replies; 101+ messages in thread
From: Lucas Stach @ 2020-03-17 18:14 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> On Tue, Mar 17, 2020 at 10:21 AM Lucas Stach <dev@lynxeye.de> wrote:
> > Am Dienstag, den 17.03.2020, 10:12 -0700 schrieb Jacob Lifshay:
> > > One related issue with explicit sync using sync_file is that combined
> > > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > > instructions for GPU tasks) but need to synchronize with other
> > > drivers/processes is that there should be some way to create an
> > > explicit fence/semaphore from userspace and later signal it. This
> > > seems to conflict with the requirement for a sync_file to complete in
> > > finite time, since the user process could be stopped or killed.
> > > 
> > > Any ideas?
> > 
> > Finite just means "not infinite". If you stop the process that's doing
> > part of the pipeline processing you block the pipeline, you get to keep
> > the pieces in that case.
> 
> Seems reasonable.
> 
> > That's one of the issues with implicit sync
> > that explicit may solve: a single client taking way too much time to
> > render something can block the whole pipeline up until the display
> > flip. With explicit sync the compositor can just decide to use the last
> > client buffer if the latest buffer isn't ready by some deadline.
> > 
> > With regard to the process getting killed: whatever you sync primitive
> > is, you need to make sure to signal the fence (possibly with an error
> > condition set) when you are not going to make progress anymore. So
> > whatever your means to creating the sync_fd from your software renderer
> > is, it needs to signal any outstanding fences on the sync_fd when the
> > fd is closed.
> 
> I think I found a userspace-accessible way to create sync_files and
> dma_fences that would fulfill the requirements:
> https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> 
> I'm just not sure if that's a good interface to use, since it appears
> to be designed only for debugging. Will have to check for additional
> requirements of signalling an error when the process that created the
> fence is killed.

Something like that can certainly be lifted for general use if it makes
sense. But then with a software renderer I don't really see how fences
help you at all. With a software renderer you know exactly when the
frame is finished and you can just defer pushing it over to the next
pipeline element until that time. You won't gain any parallelism by
using fences as the CPU is busy doing the rendering and will not run
other stuff concurrently, right?

Regards,
Lucas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 16:27               ` Jason Ekstrand
@ 2020-03-17 18:21                 ` Nicolas Dufresne
  -1 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-17 18:21 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Laurent Pinchart, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

Le mardi 17 mars 2020 à 11:27 -0500, Jason Ekstrand a écrit :
> On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > > Hi Jason,
> > > 
> > > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > > > 
> > > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > All,
> > > > > > > > 
> > > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## What are implicit and explicit synchronization?
> > > > > > > > 
> > > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > > required to ensure that everything happens in the right order and
> > > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > > examples:
> > > > > > > > 
> > > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > > contents.
> > > > > > > > 
> > > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > > > 
> > > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > > buffers in different threads where one of those command buffers
> > > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > > out.
> > > > > > > > 
> > > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > > the client-side especially) becomes more and more painful.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## Current status in Linux
> > > > > > > > 
> > > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > > submit in what order.
> > > > > > > > 
> > > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > > poll() if  you want to wait on it manually.
> > > > > > > > 
> > > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > > of the status of various components (I don't know the status of
> > > > > > > > anything in the media world):
> > > > > > > > 
> > > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > > synchronization primitives.
> > > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > > still implicit.
> > > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > > (more on that later).
> > > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > > the real synchronization is still implicit.
> > > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > > sync.
> > > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > > handling which can lead to over-synchronization.
> > > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > > explicit sync primitives.
> > > > > > > 
> > > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > > days
> > > > > > > 
> > > > > > > >  - v4l: ???
> > > > > > > >  - gstreamer: ???
> > > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > > > 
> > > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > > (it's plugin base, so this can be really messy otherwise).
> > > > > > 
> > > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > > enough.
> > > > > 
> > > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > > When you queue multiple buffers for camera capture for instance, you
> > > > > don't know until capture complete in which buffer the frame has been
> > > > > captured.
> > > > 
> > > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > > start of frame capture which buffer it's getting written into.  I
> > > > would think that the kernel APIs could be adjusted (if we find good
> > > > reason to do so!) such that they return earlier and return a (buffer,
> > > > fence) pair.  Am I missing something fundamental about video here?
> > > 
> > > For cameras I believe we could do that, yes. I was pointing out the
> > > issues caused by the current API. For video decoders I'll let Nicolas
> > > answer the question, he's way more knowledgeable that I am on that
> > > topic.
> > 
> > Right now, there is simply no uAPI for supporting asynchronous errors
> > reporting when fences are invovled. That is true for both camera's and
> > CODEC. It's likely what all the attempt was missing, I don't know
> > enough myself to suggest something.
> > 
> > Now, why Stateless video decoders are special is another subject. In
> > CODECs, the decoding and the presentation order may differ. For
> > Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> > if this bitstream is fully valid, since the it is being parsed and
> > validated by the firmware. It's also firmware job to decide which
> > buffer should be presented first.
> > 
> > In most firmware interface, that information is communicated back all
> > at once when the frame is ready to be presented (which may be quite
> > some time after it was decoded). So indeed, a fence model is not really
> > easy to add, unless the firmware was designed with that model in mind.
> 
> Just to be clear, I think we should do whatever makes sense here and
> not try to slam sync_file in when it doesn't make sense just because
> we have it.  The more I read on this thread, the less out-fences from
> video decode sound like they make sense unless we have a really solid
> plan for async error reporting.  It's possible, depending on how many
> processes are involved in the pipeline, that async error reporting
> could help reduce latency a bit if it let the kernel report the error
> directly to the last process in the chain.  However, I'm not convinced
> the potential for userspace programmer error is worth it..  That said,
> I'm happy to leave that up to the actual video experts. (I just do 3D)
> 
> > Nothing of course would prevent V4L2 framework to generically handle
> > out_fence from other producers. It does not even handle implicit fences
> > at the moment, which is already quite problematic (I've seen glitches
> > on i.MX6/8 and Raspberry Pi 4).
> > 
> > In that specific case, if the fences from etnaviv, vc graphic drivers
> > was exposed, we could solve this issue in userspace. Right now it's
> > implicit, so we rely on all DMABuf driver to have proper support, which
> > is not the case. There is V4L2 support for that coming, but the wait is
> > done synchronously in userspace call that was normally non-blocking. So
> > that is unlikely to fly.
> 
> Yeah... waits in userspace aren't what anyone wants.
> 
> > Small note, stateless video decoders don't have this issue. The
> > bitstream is validated by userspace, and userspace controls the
> > "decode" operation. This one would be a good case for bidirectional
> > fencing.
> 
> Good to know.
> 
> > > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > > is the producer and not the consumer.
> > > 
> > > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > > frequent than video capture (but it still exists), and codecs and other
> > > memory-to-memory processing devices (colorspace converters, scalers,
> > > ...) are both consumers and producers.
> > > 
> > > > > In the normal case buffers are processed in sequence, but if
> > > > > an error occurs during capture, they can be recycled internally and put
> > > > > to the back of the queue.
> > > > 
> > > > Are those errors something that can happen at any time in the middle
> > > > of a frame capture?  If so, that does make things stickier.
> > > 
> > > Yes it can. Think of packet loss when capturing from a USB webcam for
> > > instance.
> > > 
> > > > > Unless I'm mistaken, this problem also exists
> > > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > > will receive from the device, the usefulness of fences becomes very
> > > > > questionable :-)
> > > > 
> > > > Yeah, if you really are in a situation where there's no way to know
> > > > until the full frame capture has been completed which buffer is next,
> > > > then fences are useless.  You aren't in an implicit synchronization
> > > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > > for performance but perhaps unavoidable?
> > > 
> > > Probably unavoidable in some cases, but nothing that should get in the
> > > way for the discussion at hand: there's no need to migrate away from
> > > implicit sync when there's implicit sync in the first place :-)
> > > 
> > > I think we need to analyse the use cases here, and figure out at least
> > > guidelines for userspace, otherwise applications will wonder what
> > > behaviour to implement, and we'll end up with a wide variety of them.
> > > Even just on the kernel side, some V4L2 capture driver will pass
> > > erroneous frames to userspace (thus guaranteeing ordering, but without
> > > early notification of errors), some will require the frame
> > > automatically, and at least one (uvcvideo) has a module parameter to
> > > pick the desired behaviour.
> > 
> > Also, from a userspace point of view, the synchronization with the
> > "next frame" in V4L2 isn't implicit. We can poll() the device, just
> > like we'd do with a fence FD. What the explicit fence gives, is a
> > unified object we can pass to another driver, or other userspace, so we
> > can delegate the wait.
> > 
> > You refer to performance in few places. In streaming, this is often
> > measure as real-time throughput. Implicit/explicit fences don't really
> > play any role for us in this regard. V4L2 drivers, like m2m drivers,
> > works with buffer queues. So you can queue in advance many buffers on
> > the OUTPUT device side (which is the input of the m2m), and userspace
> > will queue in advance pretty much all free buffers available on the
> > CAPTURE side. The driver is never starved in that model, at the cost of
> > very large memory consumption of course. Maybe a more visual
> > representation would be:
> > 
> >   [pending job] -> [M2M Worker] -> [pending results]
> > 
> > So as long as userspace keep the pending job queue non-empty, and that
> > it consumes and give back buffers back to write the results into, the
> > driver will keep running un-interrupted. Performance remains optimal.
> > What isn't optimal is the latency. And what bugs right now is when a
> > DMAbuf implicit out fence is put back into the pending results queue,
> > since the fence is ignored.
> 
> Yes, that makes sense.  In 3D land, we're very concerned about
> latency.  Any time anyone has to stall for anything, it's a potential
> hitch in someone's game.  Being delayed by a single extra frame can be
> problematic; 2-3 frames puts the gamer at a significant disadvantage.
> In video, as long as audio and video are in sync and you aren't
> dropping frames, no one really cares about latency as long as hitting
> the pause button doesn't take too long.

Just a note, there exist low latency use cases for streaming too (sub-
frame latency between two devices). But everything I'm ware is
downstream. The one I have in mind uses a special AXI feature to
synchronize between two HW component, but the implementation is not
using either implicit or explicit fence, in fact they didn't bother
adding a specific kernel object, you have to know when you use these
downstream drivers. We are a bit far from being able to make generic
software on top of that.

The use case was less prone to capture error, since instead of a
camera, they have SDI or HDMI receiver.

> 
> What concerns me the most, I think is actually the interop issues.
> You mentioned issues with the raspberry pi.  Right now, if someone is
> rendering frames using a Vulkan driver and trying to pass those on to
> V4L for encode or to some other api such as VA-API, we don't really
> have a plan for synchronization.  Thanks to dma-buf extensions we at
> least have most of a plan for sharing the memory and negotiating image
> layouts (strides, tiling, etc.) but no plan for synchronization at

I didn't know there was plan for that, this is nice. Right now every
userspace carry this information in a slightly different and
incompatible way, translating, extrapolation, etc. It's all very error
prone.

> all.  The only thing you can do today is to use a VkFence to CPU wait
> for the 3D rendering to be 100% done and then pass the image on to the
> encoder.
> 
> The more I look over the various hacks we've done over the course of
> the last 4 years to make window systems work, the less confident I am
> that I want to expose ANY of them as an official Vulkan extension that
> we support long-term.  The one we do have which I'm reasonably happy
> to be stuck with is sync_file import/export.  That said, it's sounding
> like V4L doesn't support dma-buf implicit sync at all so maybe CPU
> waiting with a VkFence is the current state-of-the-art?
> 
> --Jason
> 
> 
> > > > Trying to understand. :-)
> > > 
> > > So am I :-)
> > 
> > Hehe, same here.
> > 
> > > > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > > > uses), so then the operation will just fail where it worked before (breaking
> > > > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > > > cleanly as we do for new APIs.
> > > > > > 
> > > > > > > > ## Chicken and egg problems
> > > > > > > > 
> > > > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > > > plumbing we do through window systems is currently of questionable
> > > > > > > > usefulness.  Why?
> > > > > > > > 
> > > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > > > being able to use (which is basically all of them) supports the
> > > > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > > > implicit sync and Vulkan working nicely together.
> > > > > > > > 
> > > > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > > > doesn't support any explicit sync APIs.
> > > > > > > 
> > > > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > > > off as I initially thought.  There may still be weird screen capture
> > > > > > > cases but the normal cases of compositing and displaying via
> > > > > > > KMS/atomic should be in reasonably good shape.
> > > > > > > 
> > > > > > > > So, yes, they can technically
> > > > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > > > want compositors doing that because my users will complain about
> > > > > > > > performance issues due to the extra blit.
> > > > > > > > 
> > > > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > > > 
> > > > > > > > To make matters worse, since most things are doing implicit
> > > > > > > > synchronization today, it's really easy to get your explicit
> > > > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > > > that's going on elsewhere.
> > > > > > > > 
> > > > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > > > actually properly test until everyone else has written their piece and
> > > > > > > > then we use explicit sync if and only if all components support it.
> > > > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > > > like a good plan to me.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > > > 
> > > > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > > > an actual proposal.
> > > > > > > > 
> > > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > > > the dma-buf before handing it to KMS.
> > > > > > > > 
> > > > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > > > 
> > > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > > > momentum to get most everything converted.
> > > > > > > > 
> > > > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > > > 
> > > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > > > 
> > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > > > 
> > > > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > > > maybe even help/review. :-)


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-17 18:21                 ` Nicolas Dufresne
  0 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-17 18:21 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, linux-media,
	Discussion of the development of and with GStreamer

Le mardi 17 mars 2020 à 11:27 -0500, Jason Ekstrand a écrit :
> On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > > Hi Jason,
> > > 
> > > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > > > 
> > > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > All,
> > > > > > > > 
> > > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## What are implicit and explicit synchronization?
> > > > > > > > 
> > > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > > required to ensure that everything happens in the right order and
> > > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > > examples:
> > > > > > > > 
> > > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > > contents.
> > > > > > > > 
> > > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > > > 
> > > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > > buffers in different threads where one of those command buffers
> > > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > > out.
> > > > > > > > 
> > > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > > the client-side especially) becomes more and more painful.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## Current status in Linux
> > > > > > > > 
> > > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > > submit in what order.
> > > > > > > > 
> > > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > > poll() if  you want to wait on it manually.
> > > > > > > > 
> > > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > > of the status of various components (I don't know the status of
> > > > > > > > anything in the media world):
> > > > > > > > 
> > > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > > synchronization primitives.
> > > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > > still implicit.
> > > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > > (more on that later).
> > > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > > the real synchronization is still implicit.
> > > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > > sync.
> > > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > > handling which can lead to over-synchronization.
> > > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > > explicit sync primitives.
> > > > > > > 
> > > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > > days
> > > > > > > 
> > > > > > > >  - v4l: ???
> > > > > > > >  - gstreamer: ???
> > > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > > > 
> > > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > > (it's plugin base, so this can be really messy otherwise).
> > > > > > 
> > > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > > enough.
> > > > > 
> > > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > > When you queue multiple buffers for camera capture for instance, you
> > > > > don't know until capture complete in which buffer the frame has been
> > > > > captured.
> > > > 
> > > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > > start of frame capture which buffer it's getting written into.  I
> > > > would think that the kernel APIs could be adjusted (if we find good
> > > > reason to do so!) such that they return earlier and return a (buffer,
> > > > fence) pair.  Am I missing something fundamental about video here?
> > > 
> > > For cameras I believe we could do that, yes. I was pointing out the
> > > issues caused by the current API. For video decoders I'll let Nicolas
> > > answer the question, he's way more knowledgeable that I am on that
> > > topic.
> > 
> > Right now, there is simply no uAPI for supporting asynchronous errors
> > reporting when fences are invovled. That is true for both camera's and
> > CODEC. It's likely what all the attempt was missing, I don't know
> > enough myself to suggest something.
> > 
> > Now, why Stateless video decoders are special is another subject. In
> > CODECs, the decoding and the presentation order may differ. For
> > Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> > if this bitstream is fully valid, since the it is being parsed and
> > validated by the firmware. It's also firmware job to decide which
> > buffer should be presented first.
> > 
> > In most firmware interface, that information is communicated back all
> > at once when the frame is ready to be presented (which may be quite
> > some time after it was decoded). So indeed, a fence model is not really
> > easy to add, unless the firmware was designed with that model in mind.
> 
> Just to be clear, I think we should do whatever makes sense here and
> not try to slam sync_file in when it doesn't make sense just because
> we have it.  The more I read on this thread, the less out-fences from
> video decode sound like they make sense unless we have a really solid
> plan for async error reporting.  It's possible, depending on how many
> processes are involved in the pipeline, that async error reporting
> could help reduce latency a bit if it let the kernel report the error
> directly to the last process in the chain.  However, I'm not convinced
> the potential for userspace programmer error is worth it..  That said,
> I'm happy to leave that up to the actual video experts. (I just do 3D)
> 
> > Nothing of course would prevent V4L2 framework to generically handle
> > out_fence from other producers. It does not even handle implicit fences
> > at the moment, which is already quite problematic (I've seen glitches
> > on i.MX6/8 and Raspberry Pi 4).
> > 
> > In that specific case, if the fences from etnaviv, vc graphic drivers
> > was exposed, we could solve this issue in userspace. Right now it's
> > implicit, so we rely on all DMABuf driver to have proper support, which
> > is not the case. There is V4L2 support for that coming, but the wait is
> > done synchronously in userspace call that was normally non-blocking. So
> > that is unlikely to fly.
> 
> Yeah... waits in userspace aren't what anyone wants.
> 
> > Small note, stateless video decoders don't have this issue. The
> > bitstream is validated by userspace, and userspace controls the
> > "decode" operation. This one would be a good case for bidirectional
> > fencing.
> 
> Good to know.
> 
> > > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > > is the producer and not the consumer.
> > > 
> > > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > > frequent than video capture (but it still exists), and codecs and other
> > > memory-to-memory processing devices (colorspace converters, scalers,
> > > ...) are both consumers and producers.
> > > 
> > > > > In the normal case buffers are processed in sequence, but if
> > > > > an error occurs during capture, they can be recycled internally and put
> > > > > to the back of the queue.
> > > > 
> > > > Are those errors something that can happen at any time in the middle
> > > > of a frame capture?  If so, that does make things stickier.
> > > 
> > > Yes it can. Think of packet loss when capturing from a USB webcam for
> > > instance.
> > > 
> > > > > Unless I'm mistaken, this problem also exists
> > > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > > will receive from the device, the usefulness of fences becomes very
> > > > > questionable :-)
> > > > 
> > > > Yeah, if you really are in a situation where there's no way to know
> > > > until the full frame capture has been completed which buffer is next,
> > > > then fences are useless.  You aren't in an implicit synchronization
> > > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > > for performance but perhaps unavoidable?
> > > 
> > > Probably unavoidable in some cases, but nothing that should get in the
> > > way for the discussion at hand: there's no need to migrate away from
> > > implicit sync when there's implicit sync in the first place :-)
> > > 
> > > I think we need to analyse the use cases here, and figure out at least
> > > guidelines for userspace, otherwise applications will wonder what
> > > behaviour to implement, and we'll end up with a wide variety of them.
> > > Even just on the kernel side, some V4L2 capture driver will pass
> > > erroneous frames to userspace (thus guaranteeing ordering, but without
> > > early notification of errors), some will require the frame
> > > automatically, and at least one (uvcvideo) has a module parameter to
> > > pick the desired behaviour.
> > 
> > Also, from a userspace point of view, the synchronization with the
> > "next frame" in V4L2 isn't implicit. We can poll() the device, just
> > like we'd do with a fence FD. What the explicit fence gives, is a
> > unified object we can pass to another driver, or other userspace, so we
> > can delegate the wait.
> > 
> > You refer to performance in few places. In streaming, this is often
> > measure as real-time throughput. Implicit/explicit fences don't really
> > play any role for us in this regard. V4L2 drivers, like m2m drivers,
> > works with buffer queues. So you can queue in advance many buffers on
> > the OUTPUT device side (which is the input of the m2m), and userspace
> > will queue in advance pretty much all free buffers available on the
> > CAPTURE side. The driver is never starved in that model, at the cost of
> > very large memory consumption of course. Maybe a more visual
> > representation would be:
> > 
> >   [pending job] -> [M2M Worker] -> [pending results]
> > 
> > So as long as userspace keep the pending job queue non-empty, and that
> > it consumes and give back buffers back to write the results into, the
> > driver will keep running un-interrupted. Performance remains optimal.
> > What isn't optimal is the latency. And what bugs right now is when a
> > DMAbuf implicit out fence is put back into the pending results queue,
> > since the fence is ignored.
> 
> Yes, that makes sense.  In 3D land, we're very concerned about
> latency.  Any time anyone has to stall for anything, it's a potential
> hitch in someone's game.  Being delayed by a single extra frame can be
> problematic; 2-3 frames puts the gamer at a significant disadvantage.
> In video, as long as audio and video are in sync and you aren't
> dropping frames, no one really cares about latency as long as hitting
> the pause button doesn't take too long.

Just a note, there exist low latency use cases for streaming too (sub-
frame latency between two devices). But everything I'm ware is
downstream. The one I have in mind uses a special AXI feature to
synchronize between two HW component, but the implementation is not
using either implicit or explicit fence, in fact they didn't bother
adding a specific kernel object, you have to know when you use these
downstream drivers. We are a bit far from being able to make generic
software on top of that.

The use case was less prone to capture error, since instead of a
camera, they have SDI or HDMI receiver.

> 
> What concerns me the most, I think is actually the interop issues.
> You mentioned issues with the raspberry pi.  Right now, if someone is
> rendering frames using a Vulkan driver and trying to pass those on to
> V4L for encode or to some other api such as VA-API, we don't really
> have a plan for synchronization.  Thanks to dma-buf extensions we at
> least have most of a plan for sharing the memory and negotiating image
> layouts (strides, tiling, etc.) but no plan for synchronization at

I didn't know there was plan for that, this is nice. Right now every
userspace carry this information in a slightly different and
incompatible way, translating, extrapolation, etc. It's all very error
prone.

> all.  The only thing you can do today is to use a VkFence to CPU wait
> for the 3D rendering to be 100% done and then pass the image on to the
> encoder.
> 
> The more I look over the various hacks we've done over the course of
> the last 4 years to make window systems work, the less confident I am
> that I want to expose ANY of them as an official Vulkan extension that
> we support long-term.  The one we do have which I'm reasonably happy
> to be stuck with is sync_file import/export.  That said, it's sounding
> like V4L doesn't support dma-buf implicit sync at all so maybe CPU
> waiting with a VkFence is the current state-of-the-art?
> 
> --Jason
> 
> 
> > > > Trying to understand. :-)
> > > 
> > > So am I :-)
> > 
> > Hehe, same here.
> > 
> > > > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > > > uses), so then the operation will just fail where it worked before (breaking
> > > > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > > > cleanly as we do for new APIs.
> > > > > > 
> > > > > > > > ## Chicken and egg problems
> > > > > > > > 
> > > > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > > > plumbing we do through window systems is currently of questionable
> > > > > > > > usefulness.  Why?
> > > > > > > > 
> > > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > > > being able to use (which is basically all of them) supports the
> > > > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > > > implicit sync and Vulkan working nicely together.
> > > > > > > > 
> > > > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > > > doesn't support any explicit sync APIs.
> > > > > > > 
> > > > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > > > off as I initially thought.  There may still be weird screen capture
> > > > > > > cases but the normal cases of compositing and displaying via
> > > > > > > KMS/atomic should be in reasonably good shape.
> > > > > > > 
> > > > > > > > So, yes, they can technically
> > > > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > > > want compositors doing that because my users will complain about
> > > > > > > > performance issues due to the extra blit.
> > > > > > > > 
> > > > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > > > 
> > > > > > > > To make matters worse, since most things are doing implicit
> > > > > > > > synchronization today, it's really easy to get your explicit
> > > > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > > > that's going on elsewhere.
> > > > > > > > 
> > > > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > > > actually properly test until everyone else has written their piece and
> > > > > > > > then we use explicit sync if and only if all components support it.
> > > > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > > > like a good plan to me.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > > > 
> > > > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > > > an actual proposal.
> > > > > > > > 
> > > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > > > the dma-buf before handing it to KMS.
> > > > > > > > 
> > > > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > > > 
> > > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > > > momentum to get most everything converted.
> > > > > > > > 
> > > > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > > > 
> > > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > > > 
> > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > > > 
> > > > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > > > maybe even help/review. :-)

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 18:14                       ` Lucas Stach
@ 2020-03-18  0:16                         ` Jacob Lifshay
  -1 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-18  0:16 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Jason Ekstrand, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
>
> Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > I think I found a userspace-accessible way to create sync_files and
> > dma_fences that would fulfill the requirements:
> > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> >
> > I'm just not sure if that's a good interface to use, since it appears
> > to be designed only for debugging. Will have to check for additional
> > requirements of signalling an error when the process that created the
> > fence is killed.
>
> Something like that can certainly be lifted for general use if it makes
> sense. But then with a software renderer I don't really see how fences
> help you at all. With a software renderer you know exactly when the
> frame is finished and you can just defer pushing it over to the next
> pipeline element until that time. You won't gain any parallelism by
> using fences as the CPU is busy doing the rendering and will not run
> other stuff concurrently, right?

There definitely may be other hardware and/or processes that can
process some stuff concurrently with the main application, such as the
compositor and or video encoding processes (for video capture).
Additionally, from what I understand, sync_file is the standard way to
export and import explicit synchronization between processes and
between drivers on Linux, so it seems like a good idea to support it
from an interoperability standpoint even if it turns out that there
aren't any scheduling/timing benefits.

Jacob

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18  0:16                         ` Jacob Lifshay
  0 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-18  0:16 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
>
> Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > I think I found a userspace-accessible way to create sync_files and
> > dma_fences that would fulfill the requirements:
> > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> >
> > I'm just not sure if that's a good interface to use, since it appears
> > to be designed only for debugging. Will have to check for additional
> > requirements of signalling an error when the process that created the
> > fence is killed.
>
> Something like that can certainly be lifted for general use if it makes
> sense. But then with a software renderer I don't really see how fences
> help you at all. With a software renderer you know exactly when the
> frame is finished and you can just defer pushing it over to the next
> pipeline element until that time. You won't gain any parallelism by
> using fences as the CPU is busy doing the rendering and will not run
> other stuff concurrently, right?

There definitely may be other hardware and/or processes that can
process some stuff concurrently with the main application, such as the
compositor and or video encoding processes (for video capture).
Additionally, from what I understand, sync_file is the standard way to
export and import explicit synchronization between processes and
between drivers on Linux, so it seems like a good idea to support it
from an interoperability standpoint even if it turns out that there
aren't any scheduling/timing benefits.

Jacob
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-18  0:16                         ` Jacob Lifshay
@ 2020-03-18  2:08                           ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-18  2:08 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Lucas Stach, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Mar 17, 2020 at 7:16 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
>
> On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
> >
> > Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > > I think I found a userspace-accessible way to create sync_files and
> > > dma_fences that would fulfill the requirements:
> > > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> > >
> > > I'm just not sure if that's a good interface to use, since it appears
> > > to be designed only for debugging. Will have to check for additional
> > > requirements of signalling an error when the process that created the
> > > fence is killed.

It is expressly only for debugging and testing.  Exposing such an API
to userspace would break the finite time guarantees that are relied
upon to keep sync_file a secure API.

> > Something like that can certainly be lifted for general use if it makes
> > sense. But then with a software renderer I don't really see how fences
> > help you at all. With a software renderer you know exactly when the
> > frame is finished and you can just defer pushing it over to the next
> > pipeline element until that time. You won't gain any parallelism by
> > using fences as the CPU is busy doing the rendering and will not run
> > other stuff concurrently, right?
>
> There definitely may be other hardware and/or processes that can
> process some stuff concurrently with the main application, such as the
> compositor and or video encoding processes (for video capture).
> Additionally, from what I understand, sync_file is the standard way to
> export and import explicit synchronization between processes and
> between drivers on Linux, so it seems like a good idea to support it
> from an interoperability standpoint even if it turns out that there
> aren't any scheduling/timing benefits.

There are different ways that one can handle interoperability,
however.  One way is to try and make the software rasterizer look as
much like a GPU as possible:  lots of threads to make things as
asynchronous as possible, "real" implementations of semaphores and
fences, etc.  Another is to let a SW rasterizer be a SW rasterizer: do
everything immediately, thread only so you can exercise all the CPU
cores, and minimally implement semaphores and fences well enough to
maintain compatibility.  If you take the first approach, then we have
to solve all these problems with letting userspace create unsignaled
sync_files which it will signal later and figure out how to make it
safe.  If you take the second approach, you'll only ever have to
return already signaled sync_files and there's no problem with the
sync_file finite time guarantees.

--Jason

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18  2:08                           ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-18  2:08 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 7:16 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
>
> On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
> >
> > Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > > I think I found a userspace-accessible way to create sync_files and
> > > dma_fences that would fulfill the requirements:
> > > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> > >
> > > I'm just not sure if that's a good interface to use, since it appears
> > > to be designed only for debugging. Will have to check for additional
> > > requirements of signalling an error when the process that created the
> > > fence is killed.

It is expressly only for debugging and testing.  Exposing such an API
to userspace would break the finite time guarantees that are relied
upon to keep sync_file a secure API.

> > Something like that can certainly be lifted for general use if it makes
> > sense. But then with a software renderer I don't really see how fences
> > help you at all. With a software renderer you know exactly when the
> > frame is finished and you can just defer pushing it over to the next
> > pipeline element until that time. You won't gain any parallelism by
> > using fences as the CPU is busy doing the rendering and will not run
> > other stuff concurrently, right?
>
> There definitely may be other hardware and/or processes that can
> process some stuff concurrently with the main application, such as the
> compositor and or video encoding processes (for video capture).
> Additionally, from what I understand, sync_file is the standard way to
> export and import explicit synchronization between processes and
> between drivers on Linux, so it seems like a good idea to support it
> from an interoperability standpoint even if it turns out that there
> aren't any scheduling/timing benefits.

There are different ways that one can handle interoperability,
however.  One way is to try and make the software rasterizer look as
much like a GPU as possible:  lots of threads to make things as
asynchronous as possible, "real" implementations of semaphores and
fences, etc.  Another is to let a SW rasterizer be a SW rasterizer: do
everything immediately, thread only so you can exercise all the CPU
cores, and minimally implement semaphores and fences well enough to
maintain compatibility.  If you take the first approach, then we have
to solve all these problems with letting userspace create unsignaled
sync_files which it will signal later and figure out how to make it
safe.  If you take the second approach, you'll only ever have to
return already signaled sync_files and there's no problem with the
sync_file finite time guarantees.

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-18  2:08                           ` Jason Ekstrand
@ 2020-03-18  5:20                             ` Jacob Lifshay
  -1 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-18  5:20 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Lucas Stach, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Mar 17, 2020 at 7:08 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Tue, Mar 17, 2020 at 7:16 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
> >
> > On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
> > >
> > > Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > > > I think I found a userspace-accessible way to create sync_files and
> > > > dma_fences that would fulfill the requirements:
> > > > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> > > >
> > > > I'm just not sure if that's a good interface to use, since it appears
> > > > to be designed only for debugging. Will have to check for additional
> > > > requirements of signalling an error when the process that created the
> > > > fence is killed.
>
> It is expressly only for debugging and testing.  Exposing such an API
> to userspace would break the finite time guarantees that are relied
> upon to keep sync_file a secure API.

Ok, I was figuring that was probably the case.

> > > Something like that can certainly be lifted for general use if it makes
> > > sense. But then with a software renderer I don't really see how fences
> > > help you at all. With a software renderer you know exactly when the
> > > frame is finished and you can just defer pushing it over to the next
> > > pipeline element until that time. You won't gain any parallelism by
> > > using fences as the CPU is busy doing the rendering and will not run
> > > other stuff concurrently, right?
> >
> > There definitely may be other hardware and/or processes that can
> > process some stuff concurrently with the main application, such as the
> > compositor and or video encoding processes (for video capture).
> > Additionally, from what I understand, sync_file is the standard way to
> > export and import explicit synchronization between processes and
> > between drivers on Linux, so it seems like a good idea to support it
> > from an interoperability standpoint even if it turns out that there
> > aren't any scheduling/timing benefits.
>
> There are different ways that one can handle interoperability,
> however.  One way is to try and make the software rasterizer look as
> much like a GPU as possible:  lots of threads to make things as
> asynchronous as possible, "real" implementations of semaphores and
> fences, etc.

This is basically the route I've picked, though rather than making
lots of native threads, I'm planning on having just one thread per
core and have a work-stealing scheduler (inspired by Rust's rayon
crate) schedule all the individual render/compute jobs, because that
allows making a lot more jobs to allow finer load balancing.

> Another is to let a SW rasterizer be a SW rasterizer: do
> everything immediately, thread only so you can exercise all the CPU
> cores, and minimally implement semaphores and fences well enough to
> maintain compatibility.  If you take the first approach, then we have
> to solve all these problems with letting userspace create unsignaled
> sync_files which it will signal later and figure out how to make it
> safe.  If you take the second approach, you'll only ever have to
> return already signaled sync_files and there's no problem with the
> sync_file finite time guarantees.

The main issue with doing everything immediately is that a lot of the
function calls that games expect to take a very short time (e.g.
vkQueueSubmit) would instead take a much longer time, potentially
causing problems.

One idea for a safe userspace-backed sync_file is to have a step
counter that counts down until the sync_file is ready, where if
userspace doesn't tell it to count any steps in a certain amount of
time, then the sync_file switches to the error state. This way, it
will error shortly after a process deadlocks for some reason, while
still having the finite-time guarantee.

When the sync_file is created, the step counter would be set to the
number of jobs that the fence is waiting on.

It can also be set to pause the timeout to wait until another
sync_file signals, to handle cases where a sync_file is waiting on a
userspace process that is waiting on another sync_file.

The main issue is that the kernel would have to make sure that the
sync_file graph doesn't have loops, maybe by erroring all sync_files
that it finds in the loop.

Does that sound like a good idea?

Jacob

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18  5:20                             ` Jacob Lifshay
  0 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-18  5:20 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 7:08 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Tue, Mar 17, 2020 at 7:16 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
> >
> > On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
> > >
> > > Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > > > I think I found a userspace-accessible way to create sync_files and
> > > > dma_fences that would fulfill the requirements:
> > > > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> > > >
> > > > I'm just not sure if that's a good interface to use, since it appears
> > > > to be designed only for debugging. Will have to check for additional
> > > > requirements of signalling an error when the process that created the
> > > > fence is killed.
>
> It is expressly only for debugging and testing.  Exposing such an API
> to userspace would break the finite time guarantees that are relied
> upon to keep sync_file a secure API.

Ok, I was figuring that was probably the case.

> > > Something like that can certainly be lifted for general use if it makes
> > > sense. But then with a software renderer I don't really see how fences
> > > help you at all. With a software renderer you know exactly when the
> > > frame is finished and you can just defer pushing it over to the next
> > > pipeline element until that time. You won't gain any parallelism by
> > > using fences as the CPU is busy doing the rendering and will not run
> > > other stuff concurrently, right?
> >
> > There definitely may be other hardware and/or processes that can
> > process some stuff concurrently with the main application, such as the
> > compositor and or video encoding processes (for video capture).
> > Additionally, from what I understand, sync_file is the standard way to
> > export and import explicit synchronization between processes and
> > between drivers on Linux, so it seems like a good idea to support it
> > from an interoperability standpoint even if it turns out that there
> > aren't any scheduling/timing benefits.
>
> There are different ways that one can handle interoperability,
> however.  One way is to try and make the software rasterizer look as
> much like a GPU as possible:  lots of threads to make things as
> asynchronous as possible, "real" implementations of semaphores and
> fences, etc.

This is basically the route I've picked, though rather than making
lots of native threads, I'm planning on having just one thread per
core and have a work-stealing scheduler (inspired by Rust's rayon
crate) schedule all the individual render/compute jobs, because that
allows making a lot more jobs to allow finer load balancing.

> Another is to let a SW rasterizer be a SW rasterizer: do
> everything immediately, thread only so you can exercise all the CPU
> cores, and minimally implement semaphores and fences well enough to
> maintain compatibility.  If you take the first approach, then we have
> to solve all these problems with letting userspace create unsignaled
> sync_files which it will signal later and figure out how to make it
> safe.  If you take the second approach, you'll only ever have to
> return already signaled sync_files and there's no problem with the
> sync_file finite time guarantees.

The main issue with doing everything immediately is that a lot of the
function calls that games expect to take a very short time (e.g.
vkQueueSubmit) would instead take a much longer time, potentially
causing problems.

One idea for a safe userspace-backed sync_file is to have a step
counter that counts down until the sync_file is ready, where if
userspace doesn't tell it to count any steps in a certain amount of
time, then the sync_file switches to the error state. This way, it
will error shortly after a process deadlocks for some reason, while
still having the finite-time guarantee.

When the sync_file is created, the step counter would be set to the
number of jobs that the fence is waiting on.

It can also be set to pause the timeout to wait until another
sync_file signals, to handle cases where a sync_file is waiting on a
userspace process that is waiting on another sync_file.

The main issue is that the kernel would have to make sure that the
sync_file graph doesn't have loops, maybe by erroring all sync_files
that it finds in the loop.

Does that sound like a good idea?

Jacob
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-18  5:20                             ` Jacob Lifshay
@ 2020-03-18  6:34                               ` Jason Ekstrand
  -1 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-18  6:34 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Lucas Stach, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Mar 18, 2020 at 12:20 AM Jacob Lifshay <programmerjake@gmail.com> wrote:
>
> On Tue, Mar 17, 2020 at 7:08 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Tue, Mar 17, 2020 at 7:16 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
> > >
> > > On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
> > > >
> > > > Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > > > > I think I found a userspace-accessible way to create sync_files and
> > > > > dma_fences that would fulfill the requirements:
> > > > > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> > > > >
> > > > > I'm just not sure if that's a good interface to use, since it appears
> > > > > to be designed only for debugging. Will have to check for additional
> > > > > requirements of signalling an error when the process that created the
> > > > > fence is killed.
> >
> > It is expressly only for debugging and testing.  Exposing such an API
> > to userspace would break the finite time guarantees that are relied
> > upon to keep sync_file a secure API.
>
> Ok, I was figuring that was probably the case.
>
> > > > Something like that can certainly be lifted for general use if it makes
> > > > sense. But then with a software renderer I don't really see how fences
> > > > help you at all. With a software renderer you know exactly when the
> > > > frame is finished and you can just defer pushing it over to the next
> > > > pipeline element until that time. You won't gain any parallelism by
> > > > using fences as the CPU is busy doing the rendering and will not run
> > > > other stuff concurrently, right?
> > >
> > > There definitely may be other hardware and/or processes that can
> > > process some stuff concurrently with the main application, such as the
> > > compositor and or video encoding processes (for video capture).
> > > Additionally, from what I understand, sync_file is the standard way to
> > > export and import explicit synchronization between processes and
> > > between drivers on Linux, so it seems like a good idea to support it
> > > from an interoperability standpoint even if it turns out that there
> > > aren't any scheduling/timing benefits.
> >
> > There are different ways that one can handle interoperability,
> > however.  One way is to try and make the software rasterizer look as
> > much like a GPU as possible:  lots of threads to make things as
> > asynchronous as possible, "real" implementations of semaphores and
> > fences, etc.
>
> This is basically the route I've picked, though rather than making
> lots of native threads, I'm planning on having just one thread per
> core and have a work-stealing scheduler (inspired by Rust's rayon
> crate) schedule all the individual render/compute jobs, because that
> allows making a lot more jobs to allow finer load balancing.
>
> > Another is to let a SW rasterizer be a SW rasterizer: do
> > everything immediately, thread only so you can exercise all the CPU
> > cores, and minimally implement semaphores and fences well enough to
> > maintain compatibility.  If you take the first approach, then we have
> > to solve all these problems with letting userspace create unsignaled
> > sync_files which it will signal later and figure out how to make it
> > safe.  If you take the second approach, you'll only ever have to
> > return already signaled sync_files and there's no problem with the
> > sync_file finite time guarantees.
>
> The main issue with doing everything immediately is that a lot of the
> function calls that games expect to take a very short time (e.g.
> vkQueueSubmit) would instead take a much longer time, potentially
> causing problems.

Do you have any evidence that it will cause problems?  What I said
above is what switfshader is doing and they're running real apps and
I've not heard of it causing any problems.  It's also worth noting
that you would only really have to stall at sync_file export.  You can
async as much as you want internally.

> One idea for a safe userspace-backed sync_file is to have a step
> counter that counts down until the sync_file is ready, where if
> userspace doesn't tell it to count any steps in a certain amount of
> time, then the sync_file switches to the error state. This way, it
> will error shortly after a process deadlocks for some reason, while
> still having the finite-time guarantee.
>
> When the sync_file is created, the step counter would be set to the
> number of jobs that the fence is waiting on.
>
> It can also be set to pause the timeout to wait until another
> sync_file signals, to handle cases where a sync_file is waiting on a
> userspace process that is waiting on another sync_file.
>
> The main issue is that the kernel would have to make sure that the
> sync_file graph doesn't have loops, maybe by erroring all sync_files
> that it finds in the loop.
>
> Does that sound like a good idea?

Honestly, I don't think you'll ever be able to sell that to the kernel
community.  All of the deadlock detection would add massive complexity
to the already non-trivial dma_fence infrastructure and for what
benefit?  So that a software rasterizer can try to pretend to be more
like a GPU?  You're going to have some very serious perf numbers
and/or other proof of necessity if you want to convince the kernel to
people to accept that level of complexity/risk.  "I designed my
software to work this way" isn't going to convince anyone of anything
especially when literally every other software rasterizer I'm aware of
is immediate and they work just fine.

--Jason

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18  6:34                               ` Jason Ekstrand
  0 siblings, 0 replies; 101+ messages in thread
From: Jason Ekstrand @ 2020-03-18  6:34 UTC (permalink / raw)
  To: Jacob Lifshay
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

On Wed, Mar 18, 2020 at 12:20 AM Jacob Lifshay <programmerjake@gmail.com> wrote:
>
> On Tue, Mar 17, 2020 at 7:08 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Tue, Mar 17, 2020 at 7:16 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
> > >
> > > On Tue, Mar 17, 2020 at 11:14 AM Lucas Stach <dev@lynxeye.de> wrote:
> > > >
> > > > Am Dienstag, den 17.03.2020, 10:59 -0700 schrieb Jacob Lifshay:
> > > > > I think I found a userspace-accessible way to create sync_files and
> > > > > dma_fences that would fulfill the requirements:
> > > > > https://github.com/torvalds/linux/blob/master/drivers/dma-buf/sw_sync.c
> > > > >
> > > > > I'm just not sure if that's a good interface to use, since it appears
> > > > > to be designed only for debugging. Will have to check for additional
> > > > > requirements of signalling an error when the process that created the
> > > > > fence is killed.
> >
> > It is expressly only for debugging and testing.  Exposing such an API
> > to userspace would break the finite time guarantees that are relied
> > upon to keep sync_file a secure API.
>
> Ok, I was figuring that was probably the case.
>
> > > > Something like that can certainly be lifted for general use if it makes
> > > > sense. But then with a software renderer I don't really see how fences
> > > > help you at all. With a software renderer you know exactly when the
> > > > frame is finished and you can just defer pushing it over to the next
> > > > pipeline element until that time. You won't gain any parallelism by
> > > > using fences as the CPU is busy doing the rendering and will not run
> > > > other stuff concurrently, right?
> > >
> > > There definitely may be other hardware and/or processes that can
> > > process some stuff concurrently with the main application, such as the
> > > compositor and or video encoding processes (for video capture).
> > > Additionally, from what I understand, sync_file is the standard way to
> > > export and import explicit synchronization between processes and
> > > between drivers on Linux, so it seems like a good idea to support it
> > > from an interoperability standpoint even if it turns out that there
> > > aren't any scheduling/timing benefits.
> >
> > There are different ways that one can handle interoperability,
> > however.  One way is to try and make the software rasterizer look as
> > much like a GPU as possible:  lots of threads to make things as
> > asynchronous as possible, "real" implementations of semaphores and
> > fences, etc.
>
> This is basically the route I've picked, though rather than making
> lots of native threads, I'm planning on having just one thread per
> core and have a work-stealing scheduler (inspired by Rust's rayon
> crate) schedule all the individual render/compute jobs, because that
> allows making a lot more jobs to allow finer load balancing.
>
> > Another is to let a SW rasterizer be a SW rasterizer: do
> > everything immediately, thread only so you can exercise all the CPU
> > cores, and minimally implement semaphores and fences well enough to
> > maintain compatibility.  If you take the first approach, then we have
> > to solve all these problems with letting userspace create unsignaled
> > sync_files which it will signal later and figure out how to make it
> > safe.  If you take the second approach, you'll only ever have to
> > return already signaled sync_files and there's no problem with the
> > sync_file finite time guarantees.
>
> The main issue with doing everything immediately is that a lot of the
> function calls that games expect to take a very short time (e.g.
> vkQueueSubmit) would instead take a much longer time, potentially
> causing problems.

Do you have any evidence that it will cause problems?  What I said
above is what switfshader is doing and they're running real apps and
I've not heard of it causing any problems.  It's also worth noting
that you would only really have to stall at sync_file export.  You can
async as much as you want internally.

> One idea for a safe userspace-backed sync_file is to have a step
> counter that counts down until the sync_file is ready, where if
> userspace doesn't tell it to count any steps in a certain amount of
> time, then the sync_file switches to the error state. This way, it
> will error shortly after a process deadlocks for some reason, while
> still having the finite-time guarantee.
>
> When the sync_file is created, the step counter would be set to the
> number of jobs that the fence is waiting on.
>
> It can also be set to pause the timeout to wait until another
> sync_file signals, to handle cases where a sync_file is waiting on a
> userspace process that is waiting on another sync_file.
>
> The main issue is that the kernel would have to make sure that the
> sync_file graph doesn't have loops, maybe by erroring all sync_files
> that it finds in the loop.
>
> Does that sound like a good idea?

Honestly, I don't think you'll ever be able to sell that to the kernel
community.  All of the deadlock detection would add massive complexity
to the already non-trivial dma_fence infrastructure and for what
benefit?  So that a software rasterizer can try to pretend to be more
like a GPU?  You're going to have some very serious perf numbers
and/or other proof of necessity if you want to convince the kernel to
people to accept that level of complexity/risk.  "I designed my
software to work this way" isn't going to convince anyone of anything
especially when literally every other software rasterizer I'm aware of
is immediate and they work just fine.

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-18  6:34                               ` Jason Ekstrand
@ 2020-03-18  7:27                                 ` Jacob Lifshay
  -1 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-18  7:27 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Lucas Stach, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Mar 17, 2020 at 11:35 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Wed, Mar 18, 2020 at 12:20 AM Jacob Lifshay <programmerjake@gmail.com> wrote:
> >
> > The main issue with doing everything immediately is that a lot of the
> > function calls that games expect to take a very short time (e.g.
> > vkQueueSubmit) would instead take a much longer time, potentially
> > causing problems.
>
> Do you have any evidence that it will cause problems?  What I said
> above is what switfshader is doing and they're running real apps and
> I've not heard of it causing any problems.  It's also worth noting
> that you would only really have to stall at sync_file export.  You can
> async as much as you want internally.

Ok, seems worth trying out.

> > One idea for a safe userspace-backed sync_file is to have a step
> > counter that counts down until the sync_file is ready, where if
> > userspace doesn't tell it to count any steps in a certain amount of
> > time, then the sync_file switches to the error state. This way, it
> > will error shortly after a process deadlocks for some reason, while
> > still having the finite-time guarantee.
> >
> > When the sync_file is created, the step counter would be set to the
> > number of jobs that the fence is waiting on.
> >
> > It can also be set to pause the timeout to wait until another
> > sync_file signals, to handle cases where a sync_file is waiting on a
> > userspace process that is waiting on another sync_file.
> >
> > The main issue is that the kernel would have to make sure that the
> > sync_file graph doesn't have loops, maybe by erroring all sync_files
> > that it finds in the loop.
> >
> > Does that sound like a good idea?
>
> Honestly, I don't think you'll ever be able to sell that to the kernel
> community.  All of the deadlock detection would add massive complexity
> to the already non-trivial dma_fence infrastructure and for what
> benefit?  So that a software rasterizer can try to pretend to be more
> like a GPU?  You're going to have some very serious perf numbers
> and/or other proof of necessity if you want to convince the kernel to
> people to accept that level of complexity/risk.  "I designed my
> software to work this way" isn't going to convince anyone of anything
> especially when literally every other software rasterizer I'm aware of
> is immediate and they work just fine.

After some further research, it turns out that it will work to have
all the sync_files that a sync_file needs to depend on specified at
creation, which forces the dependence graph to be a DAG since you
can't depend on a sync_file that isn't yet created, so loops are
impossible by design.

Since kernel deadlock detection isn't actually required, just timeouts
for the case of halted userspace, does this seem feasable?

I'd guess that it'd require maybe 200-300 lines of code in a
self-contained driver similar to the sync_file debugging driver
mentioned previously but with the additional timeout code for safety.

Jacob

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18  7:27                                 ` Jacob Lifshay
  0 siblings, 0 replies; 101+ messages in thread
From: Jacob Lifshay @ 2020-03-18  7:27 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel,
	open list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, Nicolas Dufresne,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 11:35 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Wed, Mar 18, 2020 at 12:20 AM Jacob Lifshay <programmerjake@gmail.com> wrote:
> >
> > The main issue with doing everything immediately is that a lot of the
> > function calls that games expect to take a very short time (e.g.
> > vkQueueSubmit) would instead take a much longer time, potentially
> > causing problems.
>
> Do you have any evidence that it will cause problems?  What I said
> above is what switfshader is doing and they're running real apps and
> I've not heard of it causing any problems.  It's also worth noting
> that you would only really have to stall at sync_file export.  You can
> async as much as you want internally.

Ok, seems worth trying out.

> > One idea for a safe userspace-backed sync_file is to have a step
> > counter that counts down until the sync_file is ready, where if
> > userspace doesn't tell it to count any steps in a certain amount of
> > time, then the sync_file switches to the error state. This way, it
> > will error shortly after a process deadlocks for some reason, while
> > still having the finite-time guarantee.
> >
> > When the sync_file is created, the step counter would be set to the
> > number of jobs that the fence is waiting on.
> >
> > It can also be set to pause the timeout to wait until another
> > sync_file signals, to handle cases where a sync_file is waiting on a
> > userspace process that is waiting on another sync_file.
> >
> > The main issue is that the kernel would have to make sure that the
> > sync_file graph doesn't have loops, maybe by erroring all sync_files
> > that it finds in the loop.
> >
> > Does that sound like a good idea?
>
> Honestly, I don't think you'll ever be able to sell that to the kernel
> community.  All of the deadlock detection would add massive complexity
> to the already non-trivial dma_fence infrastructure and for what
> benefit?  So that a software rasterizer can try to pretend to be more
> like a GPU?  You're going to have some very serious perf numbers
> and/or other proof of necessity if you want to convince the kernel to
> people to accept that level of complexity/risk.  "I designed my
> software to work this way" isn't going to convince anyone of anything
> especially when literally every other software rasterizer I'm aware of
> is immediate and they work just fine.

After some further research, it turns out that it will work to have
all the sync_files that a sync_file needs to depend on specified at
creation, which forces the dependence graph to be a DAG since you
can't depend on a sync_file that isn't yet created, so loops are
impossible by design.

Since kernel deadlock detection isn't actually required, just timeouts
for the case of halted userspace, does this seem feasable?

I'd guess that it'd require maybe 200-300 lines of code in a
self-contained driver similar to the sync_file debugging driver
mentioned previously but with the additional timeout code for safety.

Jacob
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:21                   ` Lucas Stach
@ 2020-03-18 10:05                     ` Michel Dänzer
  -1 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-18 10:05 UTC (permalink / raw)
  To: Lucas Stach, Jacob Lifshay, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, Laurent Pinchart

On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> That's one of the issues with implicit sync that explicit may solve: 
> a single client taking way too much time to render something can 
> block the whole pipeline up until the display flip. With explicit 
> sync the compositor can just decide to use the last client buffer if 
> the latest buffer isn't ready by some deadline.

FWIW, the compositor can do this with implicit sync as well, by polling
a dma-buf fd for the buffer. (Currently, it has to poll for writable,
because waiting for the exclusive fence only isn't enough with amdgpu)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18 10:05                     ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-18 10:05 UTC (permalink / raw)
  To: Lucas Stach, Jacob Lifshay, Jason Ekstrand
  Cc: Laurent Pinchart, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> That's one of the issues with implicit sync that explicit may solve: 
> a single client taking way too much time to render something can 
> block the whole pipeline up until the display flip. With explicit 
> sync the compositor can just decide to use the last client buffer if 
> the latest buffer isn't ready by some deadline.

FWIW, the compositor can do this with implicit sync as well, by polling
a dma-buf fd for the buffer. (Currently, it has to poll for writable,
because waiting for the exclusive fence only isn't enough with amdgpu)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-18 10:05                     ` Michel Dänzer
@ 2020-03-18 13:54                       ` Nicolas Dufresne
  -1 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-18 13:54 UTC (permalink / raw)
  To: Michel Dänzer, Lucas Stach, Jacob Lifshay, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Laurent Pinchart

Le mercredi 18 mars 2020 à 11:05 +0100, Michel Dänzer a écrit :
> On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> > That's one of the issues with implicit sync that explicit may solve: 
> > a single client taking way too much time to render something can 
> > block the whole pipeline up until the display flip. With explicit 
> > sync the compositor can just decide to use the last client buffer if 
> > the latest buffer isn't ready by some deadline.
> 
> FWIW, the compositor can do this with implicit sync as well, by polling
> a dma-buf fd for the buffer. (Currently, it has to poll for writable,
> because waiting for the exclusive fence only isn't enough with amdgpu)

That is very interesting, thanks for sharing, could allow fixing some
issues in userspace for backward compatibility.

thanks,
Nicolas


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-18 13:54                       ` Nicolas Dufresne
  0 siblings, 0 replies; 101+ messages in thread
From: Nicolas Dufresne @ 2020-03-18 13:54 UTC (permalink / raw)
  To: Michel Dänzer, Lucas Stach, Jacob Lifshay, Jason Ekstrand
  Cc: Laurent Pinchart, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	linux-media

Le mercredi 18 mars 2020 à 11:05 +0100, Michel Dänzer a écrit :
> On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> > That's one of the issues with implicit sync that explicit may solve: 
> > a single client taking way too much time to render something can 
> > block the whole pipeline up until the display flip. With explicit 
> > sync the compositor can just decide to use the last client buffer if 
> > the latest buffer isn't ready by some deadline.
> 
> FWIW, the compositor can do this with implicit sync as well, by polling
> a dma-buf fd for the buffer. (Currently, it has to poll for writable,
> because waiting for the exclusive fence only isn't enough with amdgpu)

That is very interesting, thanks for sharing, could allow fixing some
issues in userspace for backward compatibility.

thanks,
Nicolas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:18                   ` Jason Ekstrand
@ 2020-03-19 10:34                     ` Daniel Vetter
  -1 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:34 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Jacob Lifshay, Nicolas Dufresne, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	ML mesa-dev, open list:DMA BUFFER SHARING FRAMEWORK,
	Discussion of the development of and with GStreamer

On Tue, Mar 17, 2020 at 12:18:47PM -0500, Jason Ekstrand wrote:
> On Tue, Mar 17, 2020 at 12:13 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
> >
> > One related issue with explicit sync using sync_file is that combined
> > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > instructions for GPU tasks) but need to synchronize with other
> > drivers/processes is that there should be some way to create an
> > explicit fence/semaphore from userspace and later signal it. This
> > seems to conflict with the requirement for a sync_file to complete in
> > finite time, since the user process could be stopped or killed.
> 
> Yeah... That's going to be a problem.  The only way I could see that
> working is if you created a sync_file that had a timeout associated
> with it.  However, then you run into the issue where you may have
> corruption if stuff doesn't complete on time.  Then again, you're not
> really dealing with an external unit and so the latency cost of going
> across the window system protocol probably isn't massively different
> from the latency cost of triggering the sync_file.  Maybe the answer
> there is to just do everything in-order and not worry about
> synchronization?

vgem does that already (fences with timeout). The corruption issue is also
not new, if your shaders take forever real gpus will nick your rendering
with a quick reset. Iirc someone (from cros google team maybe) was even
looking into making llvmpipe run on top of vgem as a real dri/drm mesa
driver.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-19 10:34                     ` Daniel Vetter
  0 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:34 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Jacob Lifshay, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, open list:DMA BUFFER SHARING FRAMEWORK

On Tue, Mar 17, 2020 at 12:18:47PM -0500, Jason Ekstrand wrote:
> On Tue, Mar 17, 2020 at 12:13 PM Jacob Lifshay <programmerjake@gmail.com> wrote:
> >
> > One related issue with explicit sync using sync_file is that combined
> > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > instructions for GPU tasks) but need to synchronize with other
> > drivers/processes is that there should be some way to create an
> > explicit fence/semaphore from userspace and later signal it. This
> > seems to conflict with the requirement for a sync_file to complete in
> > finite time, since the user process could be stopped or killed.
> 
> Yeah... That's going to be a problem.  The only way I could see that
> working is if you created a sync_file that had a timeout associated
> with it.  However, then you run into the issue where you may have
> corruption if stuff doesn't complete on time.  Then again, you're not
> really dealing with an external unit and so the latency cost of going
> across the window system protocol probably isn't massively different
> from the latency cost of triggering the sync_file.  Maybe the answer
> there is to just do everything in-order and not worry about
> synchronization?

vgem does that already (fences with timeout). The corruption issue is also
not new, if your shaders take forever real gpus will nick your rendering
with a quick reset. Iirc someone (from cros google team maybe) was even
looking into making llvmpipe run on top of vgem as a real dri/drm mesa
driver.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-18 10:05                     ` Michel Dänzer
@ 2020-03-19 10:37                       ` Daniel Vetter
  -1 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:37 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Lucas Stach, Jacob Lifshay, Jason Ekstrand, Daniel Vetter,
	xorg-devel, linux-media, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, Laurent Pinchart

On Wed, Mar 18, 2020 at 11:05:48AM +0100, Michel Dänzer wrote:
> On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> > That's one of the issues with implicit sync that explicit may solve: 
> > a single client taking way too much time to render something can 
> > block the whole pipeline up until the display flip. With explicit 
> > sync the compositor can just decide to use the last client buffer if 
> > the latest buffer isn't ready by some deadline.
> 
> FWIW, the compositor can do this with implicit sync as well, by polling
> a dma-buf fd for the buffer. (Currently, it has to poll for writable,
> because waiting for the exclusive fence only isn't enough with amdgpu)

Would be great if we don't have to make this recommended uapi, just
because amdgpu leaks it's trickery into the wider world. Polling for read
really should be enough (and I guess Christian gets to fix up amdgpu more,
at least for anything that has a dma-buf attached even if it's not shared
with anything !amdgpu.ko).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-19 10:37                       ` Daniel Vetter
  0 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:37 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Laurent Pinchart, Jacob Lifshay, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, Nicolas Dufresne, linux-media

On Wed, Mar 18, 2020 at 11:05:48AM +0100, Michel Dänzer wrote:
> On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> > That's one of the issues with implicit sync that explicit may solve: 
> > a single client taking way too much time to render something can 
> > block the whole pipeline up until the display flip. With explicit 
> > sync the compositor can just decide to use the last client buffer if 
> > the latest buffer isn't ready by some deadline.
> 
> FWIW, the compositor can do this with implicit sync as well, by polling
> a dma-buf fd for the buffer. (Currently, it has to poll for writable,
> because waiting for the exclusive fence only isn't enough with amdgpu)

Would be great if we don't have to make this recommended uapi, just
because amdgpu leaks it's trickery into the wider world. Polling for read
really should be enough (and I guess Christian gets to fix up amdgpu more,
at least for anything that has a dma-buf attached even if it's not shared
with anything !amdgpu.ko).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 16:27               ` Jason Ekstrand
@ 2020-03-19 10:42                 ` Daniel Vetter
  -1 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:42 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Nicolas Dufresne, Laurent Pinchart, ML mesa-dev,
	Discussion of the development of and with GStreamer,
	wayland-devel @ lists . freedesktop . org, xorg-devel,
	Maling list - DRI developers, linux-media, Dave Airlie,
	Daniel Vetter, Bas Nieuwenhuizen, Daniel Stone

On Tue, Mar 17, 2020 at 11:27:28AM -0500, Jason Ekstrand wrote:
> On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >
> > Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > > Hi Jason,
> > >
> > > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > > When you queue multiple buffers for camera capture for instance, you
> > > > > don't know until capture complete in which buffer the frame has been
> > > > > captured.
> > > >
> > > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > > start of frame capture which buffer it's getting written into.  I
> > > > would think that the kernel APIs could be adjusted (if we find good
> > > > reason to do so!) such that they return earlier and return a (buffer,
> > > > fence) pair.  Am I missing something fundamental about video here?
> > >
> > > For cameras I believe we could do that, yes. I was pointing out the
> > > issues caused by the current API. For video decoders I'll let Nicolas
> > > answer the question, he's way more knowledgeable that I am on that
> > > topic.
> >
> > Right now, there is simply no uAPI for supporting asynchronous errors
> > reporting when fences are invovled. That is true for both camera's and
> > CODEC. It's likely what all the attempt was missing, I don't know
> > enough myself to suggest something.
> >
> > Now, why Stateless video decoders are special is another subject. In
> > CODECs, the decoding and the presentation order may differ. For
> > Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> > if this bitstream is fully valid, since the it is being parsed and
> > validated by the firmware. It's also firmware job to decide which
> > buffer should be presented first.
> >
> > In most firmware interface, that information is communicated back all
> > at once when the frame is ready to be presented (which may be quite
> > some time after it was decoded). So indeed, a fence model is not really
> > easy to add, unless the firmware was designed with that model in mind.
> 
> Just to be clear, I think we should do whatever makes sense here and
> not try to slam sync_file in when it doesn't make sense just because
> we have it.  The more I read on this thread, the less out-fences from
> video decode sound like they make sense unless we have a really solid
> plan for async error reporting.  It's possible, depending on how many
> processes are involved in the pipeline, that async error reporting
> could help reduce latency a bit if it let the kernel report the error
> directly to the last process in the chain.  However, I'm not convinced
> the potential for userspace programmer error is worth it..  That said,
> I'm happy to leave that up to the actual video experts. (I just do 3D)

dma_fence has an error state which you can set when things went south. The
fence still completes (to guarantee forward progress).

Currently that error code isn't really propagated anywhere (well i915 iirc
does something like that since it tracks the depedencies internally in the
scheduler). Definitely not at the dma_fence level, since we don't track
the dependency graph there at all. We might want to add that, would at
least be possible.

If we track the cascading dma_fence error state in the kernel I do think
this could work. I'm not sure whether it's actually a good/useful idea
still.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-19 10:42                 ` Daniel Vetter
  0 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:42 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, Laurent Pinchart

On Tue, Mar 17, 2020 at 11:27:28AM -0500, Jason Ekstrand wrote:
> On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >
> > Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > > Hi Jason,
> > >
> > > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > > When you queue multiple buffers for camera capture for instance, you
> > > > > don't know until capture complete in which buffer the frame has been
> > > > > captured.
> > > >
> > > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > > start of frame capture which buffer it's getting written into.  I
> > > > would think that the kernel APIs could be adjusted (if we find good
> > > > reason to do so!) such that they return earlier and return a (buffer,
> > > > fence) pair.  Am I missing something fundamental about video here?
> > >
> > > For cameras I believe we could do that, yes. I was pointing out the
> > > issues caused by the current API. For video decoders I'll let Nicolas
> > > answer the question, he's way more knowledgeable that I am on that
> > > topic.
> >
> > Right now, there is simply no uAPI for supporting asynchronous errors
> > reporting when fences are invovled. That is true for both camera's and
> > CODEC. It's likely what all the attempt was missing, I don't know
> > enough myself to suggest something.
> >
> > Now, why Stateless video decoders are special is another subject. In
> > CODECs, the decoding and the presentation order may differ. For
> > Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> > if this bitstream is fully valid, since the it is being parsed and
> > validated by the firmware. It's also firmware job to decide which
> > buffer should be presented first.
> >
> > In most firmware interface, that information is communicated back all
> > at once when the frame is ready to be presented (which may be quite
> > some time after it was decoded). So indeed, a fence model is not really
> > easy to add, unless the firmware was designed with that model in mind.
> 
> Just to be clear, I think we should do whatever makes sense here and
> not try to slam sync_file in when it doesn't make sense just because
> we have it.  The more I read on this thread, the less out-fences from
> video decode sound like they make sense unless we have a really solid
> plan for async error reporting.  It's possible, depending on how many
> processes are involved in the pipeline, that async error reporting
> could help reduce latency a bit if it let the kernel report the error
> directly to the last process in the chain.  However, I'm not convinced
> the potential for userspace programmer error is worth it..  That said,
> I'm happy to leave that up to the actual video experts. (I just do 3D)

dma_fence has an error state which you can set when things went south. The
fence still completes (to guarantee forward progress).

Currently that error code isn't really propagated anywhere (well i915 iirc
does something like that since it tracks the depedencies internally in the
scheduler). Definitely not at the dma_fence level, since we don't track
the dependency graph there at all. We might want to add that, would at
least be possible.

If we track the cascading dma_fence error state in the kernel I do think
this could work. I'm not sure whether it's actually a good/useful idea
still.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 10:01             ` Michel Dänzer
@ 2020-03-19 10:51               ` Daniel Vetter
  -1 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:51 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Marek Olšák, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On Tue, Mar 17, 2020 at 11:01:57AM +0100, Michel Dänzer wrote:
> On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> >>> The synchronization works because the Mesa driver waits for idle (drains
> >>> the GFX pipeline) at the end of command buffers and there is only 1
> >>> graphics queue, so everything is ordered.
> >>>
> >>> The GFX pipeline runs asynchronously to the command buffer, meaning the
> >>> command buffer only starts draws and doesn't wait for completion. If the
> >>> Mesa driver didn't wait at the end of the command buffer, the command
> >>> buffer would finish and a different process could start execution of its
> >>> own command buffer while shaders of the previous process are still
> >> running.
> >>>
> >>> If the Mesa driver submits a command buffer internally (because it's
> >> full),
> >>> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> >>> ended and a new one started.
> >>>
> >>> The waiting at the end of command buffers happens only when the flush is
> >>> external (Swap buffers, glFlush).
> >>>
> >>> It's a performance problem, because the GFX queue is blocked until the
> >> GFX
> >>> pipeline is drained at the end of every frame at least.
> >>>
> >>> So explicit fences for SwapBuffers would help.
> >>
> >> Not sure what difference it would make, since the same thing needs to be
> >> done for explicit fences as well, doesn't it?
> > 
> > No. Explicit fences don't require userspace to wait for idle in the command
> > buffer. Fences are signalled when the last draw is complete and caches are
> > flushed. Before that happens, any command buffer that is not dependent on
> > the fence can start execution. There is never a need for the GPU to be idle
> > if there is enough independent work to do.
> 
> I don't think explicit fences in the context of this discussion imply
> using that different fence signalling mechanism though. My understanding
> is that the API proposed by Jason allows implicit fences to be used as
> explicit ones and vice versa, so presumably they have to use the same
> signalling mechanism.
> 
> 
> Anyway, maybe the different fence signalling mechanism you describe
> could be used by the amdgpu kernel driver in general, then Mesa could
> drop the waits for idle and get the benefits with implicit sync as well?

Yeah, this is entirely about the programming model visible to userspace.
There shouldn't be any impact on the driver's choice of a top vs. bottom
of the gpu pipeline used for synchronization, that's entirely up to what
you're hw/driver/scheduler can pull off.

Doing a full gfx pipeline flush for shared buffers, when your hw can do
be, sounds like an issue to me that's not related to this here at all. It
might be intertwined with amdgpu's special interpretation of dma_resv
fences though, no idea. We might need to revamp all that. But for a
userspace client that does nothing fancy (no multiple render buffer
targets in one bo, or vk style "I write to everything all the time,
perhaps" stuff) there should be 0 perf difference between implicit sync
through dma_resv and explicit sync through sync_file/syncobj/dma_fence
directly.

If there is I'd consider that a bit a driver bug.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-19 10:51               ` Daniel Vetter
  0 siblings, 0 replies; 101+ messages in thread
From: Daniel Vetter @ 2020-03-19 10:51 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Marek Olšák, Daniel Vetter, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On Tue, Mar 17, 2020 at 11:01:57AM +0100, Michel Dänzer wrote:
> On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> >>> The synchronization works because the Mesa driver waits for idle (drains
> >>> the GFX pipeline) at the end of command buffers and there is only 1
> >>> graphics queue, so everything is ordered.
> >>>
> >>> The GFX pipeline runs asynchronously to the command buffer, meaning the
> >>> command buffer only starts draws and doesn't wait for completion. If the
> >>> Mesa driver didn't wait at the end of the command buffer, the command
> >>> buffer would finish and a different process could start execution of its
> >>> own command buffer while shaders of the previous process are still
> >> running.
> >>>
> >>> If the Mesa driver submits a command buffer internally (because it's
> >> full),
> >>> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> >>> ended and a new one started.
> >>>
> >>> The waiting at the end of command buffers happens only when the flush is
> >>> external (Swap buffers, glFlush).
> >>>
> >>> It's a performance problem, because the GFX queue is blocked until the
> >> GFX
> >>> pipeline is drained at the end of every frame at least.
> >>>
> >>> So explicit fences for SwapBuffers would help.
> >>
> >> Not sure what difference it would make, since the same thing needs to be
> >> done for explicit fences as well, doesn't it?
> > 
> > No. Explicit fences don't require userspace to wait for idle in the command
> > buffer. Fences are signalled when the last draw is complete and caches are
> > flushed. Before that happens, any command buffer that is not dependent on
> > the fence can start execution. There is never a need for the GPU to be idle
> > if there is enough independent work to do.
> 
> I don't think explicit fences in the context of this discussion imply
> using that different fence signalling mechanism though. My understanding
> is that the API proposed by Jason allows implicit fences to be used as
> explicit ones and vice versa, so presumably they have to use the same
> signalling mechanism.
> 
> 
> Anyway, maybe the different fence signalling mechanism you describe
> could be used by the amdgpu kernel driver in general, then Mesa could
> drop the waits for idle and get the benefits with implicit sync as well?

Yeah, this is entirely about the programming model visible to userspace.
There shouldn't be any impact on the driver's choice of a top vs. bottom
of the gpu pipeline used for synchronization, that's entirely up to what
you're hw/driver/scheduler can pull off.

Doing a full gfx pipeline flush for shared buffers, when your hw can do
be, sounds like an issue to me that's not related to this here at all. It
might be intertwined with amdgpu's special interpretation of dma_resv
fences though, no idea. We might need to revamp all that. But for a
userspace client that does nothing fancy (no multiple render buffer
targets in one bo, or vk style "I write to everything all the time,
perhaps" stuff) there should be 0 perf difference between implicit sync
through dma_resv and explicit sync through sync_file/syncobj/dma_fence
directly.

If there is I'd consider that a bit a driver bug.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
  2020-03-17 17:12                 ` Jacob Lifshay
@ 2020-03-19 15:45                   ` Adam Jackson
  -1 siblings, 0 replies; 101+ messages in thread
From: Adam Jackson @ 2020-03-19 15:45 UTC (permalink / raw)
  To: Jacob Lifshay, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org, Laurent Pinchart,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, linux-media

On Tue, 2020-03-17 at 10:12 -0700, Jacob Lifshay wrote:
> One related issue with explicit sync using sync_file is that combined
> CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> rendering in userspace (like llvmpipe but for Vulkan and with extra
> instructions for GPU tasks) but need to synchronize with other
> drivers/processes is that there should be some way to create an
> explicit fence/semaphore from userspace and later signal it. This
> seems to conflict with the requirement for a sync_file to complete in
> finite time, since the user process could be stopped or killed.

DRI3 (okay, libxshmfence specifically) uses futexes for this. Would
that work for you? IIRC the semantics there are that if the process
dies the futex is treated as triggered, which seems like the only
sensible thing to do.

- ajax


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-19 15:45                   ` Adam Jackson
  0 siblings, 0 replies; 101+ messages in thread
From: Adam Jackson @ 2020-03-19 15:45 UTC (permalink / raw)
  To: Jacob Lifshay, Jason Ekstrand
  Cc: Daniel Vetter, xorg-devel, linux-media,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer, ML mesa-dev,
	Nicolas Dufresne, Laurent Pinchart

On Tue, 2020-03-17 at 10:12 -0700, Jacob Lifshay wrote:
> One related issue with explicit sync using sync_file is that combined
> CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> rendering in userspace (like llvmpipe but for Vulkan and with extra
> instructions for GPU tasks) but need to synchronize with other
> drivers/processes is that there should be some way to create an
> explicit fence/semaphore from userspace and later signal it. This
> seems to conflict with the requirement for a sync_file to complete in
> finite time, since the user process could be stopped or killed.

DRI3 (okay, libxshmfence specifically) uses futexes for this. Would
that work for you? IIRC the semantics there are that if the process
dies the futex is treated as triggered, which seems like the only
sensible thing to do.

- ajax

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-19 10:51               ` Daniel Vetter
  (?)
@ 2020-03-19 19:54               ` Marek Olšák
  2020-03-20  8:50                   ` Michel Dänzer
  -1 siblings, 1 reply; 101+ messages in thread
From: Marek Olšák @ 2020-03-19 19:54 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, Michel Dänzer, xorg-devel,
	Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media


[-- Attachment #1.1: Type: text/plain, Size: 4068 bytes --]

On Thu., Mar. 19, 2020, 06:51 Daniel Vetter, <daniel@ffwll.ch> wrote:

> On Tue, Mar 17, 2020 at 11:01:57AM +0100, Michel Dänzer wrote:
> > On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer <michel@daenzer.net>
> wrote:
> > >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> > >>> The synchronization works because the Mesa driver waits for idle
> (drains
> > >>> the GFX pipeline) at the end of command buffers and there is only 1
> > >>> graphics queue, so everything is ordered.
> > >>>
> > >>> The GFX pipeline runs asynchronously to the command buffer, meaning
> the
> > >>> command buffer only starts draws and doesn't wait for completion. If
> the
> > >>> Mesa driver didn't wait at the end of the command buffer, the command
> > >>> buffer would finish and a different process could start execution of
> its
> > >>> own command buffer while shaders of the previous process are still
> > >> running.
> > >>>
> > >>> If the Mesa driver submits a command buffer internally (because it's
> > >> full),
> > >>> it doesn't wait, so the GFX pipeline doesn't notice that a command
> buffer
> > >>> ended and a new one started.
> > >>>
> > >>> The waiting at the end of command buffers happens only when the
> flush is
> > >>> external (Swap buffers, glFlush).
> > >>>
> > >>> It's a performance problem, because the GFX queue is blocked until
> the
> > >> GFX
> > >>> pipeline is drained at the end of every frame at least.
> > >>>
> > >>> So explicit fences for SwapBuffers would help.
> > >>
> > >> Not sure what difference it would make, since the same thing needs to
> be
> > >> done for explicit fences as well, doesn't it?
> > >
> > > No. Explicit fences don't require userspace to wait for idle in the
> command
> > > buffer. Fences are signalled when the last draw is complete and caches
> are
> > > flushed. Before that happens, any command buffer that is not dependent
> on
> > > the fence can start execution. There is never a need for the GPU to be
> idle
> > > if there is enough independent work to do.
> >
> > I don't think explicit fences in the context of this discussion imply
> > using that different fence signalling mechanism though. My understanding
> > is that the API proposed by Jason allows implicit fences to be used as
> > explicit ones and vice versa, so presumably they have to use the same
> > signalling mechanism.
> >
> >
> > Anyway, maybe the different fence signalling mechanism you describe
> > could be used by the amdgpu kernel driver in general, then Mesa could
> > drop the waits for idle and get the benefits with implicit sync as well?
>
> Yeah, this is entirely about the programming model visible to userspace.
> There shouldn't be any impact on the driver's choice of a top vs. bottom
> of the gpu pipeline used for synchronization, that's entirely up to what
> you're hw/driver/scheduler can pull off.
>
> Doing a full gfx pipeline flush for shared buffers, when your hw can do
> be, sounds like an issue to me that's not related to this here at all. It
> might be intertwined with amdgpu's special interpretation of dma_resv
> fences though, no idea. We might need to revamp all that. But for a
> userspace client that does nothing fancy (no multiple render buffer
> targets in one bo, or vk style "I write to everything all the time,
> perhaps" stuff) there should be 0 perf difference between implicit sync
> through dma_resv and explicit sync through sync_file/syncobj/dma_fence
> directly.
>
> If there is I'd consider that a bit a driver bug.
>

Last time I checked, there was no fence sync in gnome shell and compiz
after an app passes a buffer to it. So drivers have to invent hacks to work
around it and decrease performance. It's not a driver bug.

Implicit sync really means that apps and compositors don't sync, so the
driver has to guess when it should sync.

Marek


-Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #1.2: Type: text/html, Size: 5423 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
  2020-03-19 19:54               ` Marek Olšák
@ 2020-03-20  8:50                   ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-20  8:50 UTC (permalink / raw)
  To: Marek Olšák, Daniel Vetter
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On 2020-03-19 8:54 p.m., Marek Olšák wrote:
> On Thu., Mar. 19, 2020, 06:51 Daniel Vetter, <daniel@ffwll.ch>
> wrote:
>> 
>> Yeah, this is entirely about the programming model visible to
>> userspace. There shouldn't be any impact on the driver's choice of
>> a top vs. bottom of the gpu pipeline used for synchronization,
>> that's entirely up to what you're hw/driver/scheduler can pull
>> off.
>> 
>> Doing a full gfx pipeline flush for shared buffers, when your hw
>> can do be, sounds like an issue to me that's not related to this
>> here at all. It might be intertwined with amdgpu's special
>> interpretation of dma_resv fences though, no idea. We might need to
>> revamp all that. But for a userspace client that does nothing fancy
>> (no multiple render buffer targets in one bo, or vk style "I write
>> to everything all the time, perhaps" stuff) there should be 0 perf
>> difference between implicit sync through dma_resv and explicit sync
>> through sync_file/syncobj/dma_fence directly.
>> 
>> If there is I'd consider that a bit a driver bug.
> 
> Last time I checked, there was no fence sync in gnome shell and
> compiz after an app passes a buffer to it.

They are not required (though encouraged) to do that.


> So drivers have to invent hacks to work around it and decrease
> performance. It's not a driver bug.
> 
> Implicit sync really means that apps and compositors don't sync, so
> the driver has to guess when it should sync.

Making implicit sync work correctly is ultimately the kernel driver's
responsibility. It sounds like radeonsi is having to work around the
amdgpu/radeon kernel driver(s) not fully living up to this responsibility.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Plumbing explicit synchronization through the Linux ecosystem
@ 2020-03-20  8:50                   ` Michel Dänzer
  0 siblings, 0 replies; 101+ messages in thread
From: Michel Dänzer @ 2020-03-20  8:50 UTC (permalink / raw)
  To: Marek Olšák, Daniel Vetter
  Cc: Daniel Vetter, xorg-devel, Maling list - DRI developers,
	wayland-devel @ lists . freedesktop . org,
	Discussion of the development of and with GStreamer,
	Jason Ekstrand, ML mesa-dev, linux-media

On 2020-03-19 8:54 p.m., Marek Olšák wrote:
> On Thu., Mar. 19, 2020, 06:51 Daniel Vetter, <daniel@ffwll.ch>
> wrote:
>> 
>> Yeah, this is entirely about the programming model visible to
>> userspace. There shouldn't be any impact on the driver's choice of
>> a top vs. bottom of the gpu pipeline used for synchronization,
>> that's entirely up to what you're hw/driver/scheduler can pull
>> off.
>> 
>> Doing a full gfx pipeline flush for shared buffers, when your hw
>> can do be, sounds like an issue to me that's not related to this
>> here at all. It might be intertwined with amdgpu's special
>> interpretation of dma_resv fences though, no idea. We might need to
>> revamp all that. But for a userspace client that does nothing fancy
>> (no multiple render buffer targets in one bo, or vk style "I write
>> to everything all the time, perhaps" stuff) there should be 0 perf
>> difference between implicit sync through dma_resv and explicit sync
>> through sync_file/syncobj/dma_fence directly.
>> 
>> If there is I'd consider that a bit a driver bug.
> 
> Last time I checked, there was no fence sync in gnome shell and
> compiz after an app passes a buffer to it.

They are not required (though encouraged) to do that.


> So drivers have to invent hacks to work around it and decrease
> performance. It's not a driver bug.
> 
> Implicit sync really means that apps and compositors don't sync, so
> the driver has to guess when it should sync.

Making implicit sync work correctly is ultimately the kernel driver's
responsibility. It sounds like radeonsi is having to work around the
amdgpu/radeon kernel driver(s) not fully living up to this responsibility.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2020-03-20  8:50 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-11 17:31 Plumbing explicit synchronization through the Linux ecosystem Jason Ekstrand
2020-03-11 17:31 ` Jason Ekstrand
2020-03-11 19:21 ` Jason Ekstrand
2020-03-11 19:21   ` Jason Ekstrand
2020-03-11 20:18   ` Nicolas Dufresne
2020-03-11 20:18     ` Nicolas Dufresne
2020-03-16 10:20     ` Laurent Pinchart
2020-03-16 10:20       ` Laurent Pinchart
2020-03-16 12:55       ` Tomek Bury
2020-03-16 13:01         ` Laurent Pinchart
2020-03-16 13:01           ` Laurent Pinchart
2020-03-16 13:34           ` Tomek Bury
2020-03-16 13:34             ` Tomek Bury
2020-03-16 14:19         ` Daniel Stone
2020-03-16 14:19           ` Daniel Stone
2020-03-16 15:33           ` Tomek Bury
2020-03-16 15:33             ` Tomek Bury
2020-03-16 16:03             ` Tomek Bury
2020-03-16 16:03               ` Tomek Bury
2020-03-16 16:04             ` Jason Ekstrand
2020-03-16 16:04               ` Jason Ekstrand
2020-03-17  8:01               ` Simon Ser
2020-03-17  8:01                 ` Simon Ser
2020-03-17 14:38                 ` Jason Ekstrand
2020-03-17 14:38                   ` Jason Ekstrand
2020-03-16 16:04             ` Daniel Stone
2020-03-16 16:04               ` Daniel Stone
2020-03-16 17:11               ` Tomek Bury
2020-03-16 17:11                 ` Tomek Bury
2020-03-16 15:06       ` Jason Ekstrand
2020-03-16 15:06         ` Jason Ekstrand
2020-03-16 21:15         ` Laurent Pinchart
2020-03-16 21:15           ` Laurent Pinchart
2020-03-16 22:02           ` Jason Ekstrand
2020-03-16 22:02             ` Jason Ekstrand
2020-03-17 15:33           ` Nicolas Dufresne
2020-03-17 15:33             ` Nicolas Dufresne
2020-03-17 16:27             ` Jason Ekstrand
2020-03-17 16:27               ` Jason Ekstrand
2020-03-17 17:12               ` [Mesa-dev] " Jacob Lifshay
2020-03-17 17:12                 ` Jacob Lifshay
2020-03-17 17:18                 ` Jason Ekstrand
2020-03-17 17:18                   ` Jason Ekstrand
2020-03-19 10:34                   ` Daniel Vetter
2020-03-19 10:34                     ` Daniel Vetter
2020-03-17 17:21                 ` Lucas Stach
2020-03-17 17:21                   ` Lucas Stach
2020-03-17 17:59                   ` Jacob Lifshay
2020-03-17 17:59                     ` Jacob Lifshay
2020-03-17 18:14                     ` Lucas Stach
2020-03-17 18:14                       ` Lucas Stach
2020-03-18  0:16                       ` Jacob Lifshay
2020-03-18  0:16                         ` Jacob Lifshay
2020-03-18  2:08                         ` Jason Ekstrand
2020-03-18  2:08                           ` Jason Ekstrand
2020-03-18  5:20                           ` Jacob Lifshay
2020-03-18  5:20                             ` Jacob Lifshay
2020-03-18  6:34                             ` Jason Ekstrand
2020-03-18  6:34                               ` Jason Ekstrand
2020-03-18  7:27                               ` Jacob Lifshay
2020-03-18  7:27                                 ` Jacob Lifshay
2020-03-18 10:05                   ` Michel Dänzer
2020-03-18 10:05                     ` Michel Dänzer
2020-03-18 13:54                     ` Nicolas Dufresne
2020-03-18 13:54                       ` Nicolas Dufresne
2020-03-19 10:37                     ` Daniel Vetter
2020-03-19 10:37                       ` Daniel Vetter
2020-03-19 15:45                 ` Adam Jackson
2020-03-19 15:45                   ` Adam Jackson
2020-03-17 18:21               ` Nicolas Dufresne
2020-03-17 18:21                 ` Nicolas Dufresne
2020-03-19 10:42               ` Daniel Vetter
2020-03-19 10:42                 ` Daniel Vetter
2020-03-17 17:34             ` [Mesa-dev] " Lucas Stach
2020-03-17 17:34               ` Lucas Stach
2020-03-16 23:41   ` Roman Gilg
2020-03-16 23:41     ` Roman Gilg
2020-03-17  3:37     ` Jason Ekstrand
2020-03-17  3:37       ` Jason Ekstrand
2020-03-17  7:53       ` Jonas Ådahl
2020-03-17  7:53         ` Jonas Ådahl
2020-03-11 23:02 ` Adam Jackson
2020-03-11 23:02   ` Adam Jackson
2020-03-12 15:46   ` Jason Ekstrand
2020-03-12 15:46     ` Jason Ekstrand
2020-03-13  1:37 ` Alexander E. Patrakov
2020-03-13  1:37   ` Alexander E. Patrakov
2020-03-14  2:02 ` [Mesa-dev] " Marek Olšák
2020-03-16  2:49   ` Jason Ekstrand
2020-03-16  3:50     ` Marek Olšák
2020-03-16  9:57       ` Michel Dänzer
2020-03-16  9:57         ` Michel Dänzer
2020-03-16 18:33         ` Marek Olšák
2020-03-17 10:01           ` Michel Dänzer
2020-03-17 10:01             ` Michel Dänzer
2020-03-17 17:13             ` Marek Olšák
2020-03-19 10:51             ` Daniel Vetter
2020-03-19 10:51               ` Daniel Vetter
2020-03-19 19:54               ` Marek Olšák
2020-03-20  8:50                 ` Michel Dänzer
2020-03-20  8:50                   ` Michel Dänzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.