All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux Graphics Next: Userspace submission update
@ 2021-05-27 21:51 Marek Olšák
  2021-05-28 14:41 ` Christian König
  2021-06-01  9:02 ` Michel Dänzer
  0 siblings, 2 replies; 50+ messages in thread
From: Marek Olšák @ 2021-05-27 21:51 UTC (permalink / raw)
  To: dri-devel, ML Mesa-dev, Alex Deucher, Christian König,
	Daniel Vetter, Dave Airlie, Jason Ekstrand, Bas Nieuwenhuizen

[-- Attachment #1: Type: text/plain, Size: 3939 bytes --]

Hi,

Since Christian believes that we can't deadlock the kernel with some
changes there, we just need to make everything nice for userspace too.
Instead of explaining how it will work, I will explain the cases where
future hardware (and its kernel driver) will break existing userspace in
order to protect everybody from deadlocks. Anything that uses implicit sync
will be spared, so X and Wayland will be fine, assuming they don't
import/export fences. Those use cases that do import/export fences might or
might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The
semantics of imported/exported fences will change completely and will have
new restrictions on the usage. The restrictions are:


1) Android sync files will be impossible to support, so won't be supported.
(they don't allow future fences)


2) Implicit sync and explicit sync will be mutually exclusive between
process. A process can either use one or the other, but not both. This is
meant to prevent a deadlock condition with future fences where any process
can malevolently deadlock execution of any other process, even execution of
a higher-privileged process. The kernel will impose the following
restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't
import/export a fence from/to another process
b) a process with an imported/exported fence can't import/export an
implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions
instead of the kernel to protect itself from the deadlock, but this would
be a can of worms for existing userspace. It would be better if the kernel
just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously,
sending a future fence that will never signal to any process will deadlock
that process after it acquires the implicit sync lock, which is a sequence
number that the process is required to write to memory and send an
interrupt from the GPU in a finite time. This is how the deadlock can
happen:

* The process gets sequence number N from the kernel for an
implicitly-sync'd buffer.
* The process inserts (into the GPU user-mapped queue) a wait for sequence
number N-1.
* The process inserts a wait for a fence, but it doesn't know that it will
never signal ==> deadlock.
...
* The process inserts a command to write sequence number N to a
predetermined memory location. (which will make the buffer idle and send an
interrupt to the kernel)
...
* The kernel will terminate the process because it has never received the
interrupt. (i.e. a less-privileged process just killed a more-privileged
process)

It's the interrupt for implicit sync that never arrived that caused the
termination, and the only way another process can cause it is by sending a
fence that will never signal. Thus, importing/exporting fences from/to
other processes can't be allowed simultaneously with implicit sync.


3) Compositors (and other privileged processes, and display flipping) can't
trust imported/exported fences. They need a timeout recovery mechanism from
the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering
will choose between the latest content (if signalled) and previous content
(if timed out)

The result would be that the desktop can run close to 60 fps even if an app
runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the
only way to have userspace GPU command submission, and the deadlock example
here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks,
Marek

[-- Attachment #2: Type: text/html, Size: 4593 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-05-27 21:51 Linux Graphics Next: Userspace submission update Marek Olšák
@ 2021-05-28 14:41 ` Christian König
  2021-05-28 22:25   ` Marek Olšák
  2021-06-01  9:02 ` Michel Dänzer
  1 sibling, 1 reply; 50+ messages in thread
From: Christian König @ 2021-05-28 14:41 UTC (permalink / raw)
  To: Marek Olšák, dri-devel, ML Mesa-dev, Alex Deucher,
	Daniel Vetter, Dave Airlie, Jason Ekstrand, Bas Nieuwenhuizen

[-- Attachment #1: Type: text/plain, Size: 4741 bytes --]

Hi Marek,

well I don't think that implicit and explicit synchronization needs to 
be mutual exclusive.

What we should do is to have the ability to transport an synchronization 
object with each BO.

Implicit and explicit synchronization then basically become the same, 
they just transport the synchronization object differently.

The biggest problem are the sync_files for Android, since they are 
really not easy to support at all. If Android wants to support user 
queues we would probably have to do some changes there.

Regards,
Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:
> Hi,
>
> Since Christian believes that we can't deadlock the kernel with some 
> changes there, we just need to make everything nice for userspace too. 
> Instead of explaining how it will work, I will explain the cases where 
> future hardware (and its kernel driver) will break existing userspace 
> in order to protect everybody from deadlocks. Anything that uses 
> implicit sync will be spared, so X and Wayland will be fine, assuming 
> they don't import/export fences. Those use cases that do import/export 
> fences might or might not work, depending on how the fences are used.
>
> One of the necessities is that all fences will become future fences. 
> The semantics of imported/exported fences will change completely and 
> will have new restrictions on the usage. The restrictions are:
>
>
> 1) Android sync files will be impossible to support, so won't be 
> supported. (they don't allow future fences)
>
>
> 2) Implicit sync and explicit sync will be mutually exclusive between 
> process. A process can either use one or the other, but not both. This 
> is meant to prevent a deadlock condition with future fences where any 
> process can malevolently deadlock execution of any other process, even 
> execution of a higher-privileged process. The kernel will impose the 
> following restrictions to protect against the deadlock:
>
> a) a process with an implicitly-sync'd imported/exported buffer can't 
> import/export a fence from/to another process
> b) a process with an imported/exported fence can't import/export an 
> implicitly-sync'd buffer from/to another process
>
> Alternative: A higher-privileged process could enforce both 
> restrictions instead of the kernel to protect itself from the 
> deadlock, but this would be a can of worms for existing userspace. It 
> would be better if the kernel just broke unsafe userspace on future 
> hw, just like sync files.
>
> If both implicit and explicit sync are allowed to occur 
> simultaneously, sending a future fence that will never signal to any 
> process will deadlock that process after it acquires the implicit sync 
> lock, which is a sequence number that the process is required to write 
> to memory and send an interrupt from the GPU in a finite time. This is 
> how the deadlock can happen:
>
> * The process gets sequence number N from the kernel for an 
> implicitly-sync'd buffer.
> * The process inserts (into the GPU user-mapped queue) a wait for 
> sequence number N-1.
> * The process inserts a wait for a fence, but it doesn't know that it 
> will never signal ==> deadlock.
> ...
> * The process inserts a command to write sequence number N to a 
> predetermined memory location. (which will make the buffer idle and 
> send an interrupt to the kernel)
> ...
> * The kernel will terminate the process because it has never received 
> the interrupt. (i.e. a less-privileged process just killed a 
> more-privileged process)
>
> It's the interrupt for implicit sync that never arrived that caused 
> the termination, and the only way another process can cause it is by 
> sending a fence that will never signal. Thus, importing/exporting 
> fences from/to other processes can't be allowed simultaneously with 
> implicit sync.
>
>
> 3) Compositors (and other privileged processes, and display flipping) 
> can't trust imported/exported fences. They need a timeout recovery 
> mechanism from the beginning, and the following are some possible 
> solutions to timeouts:
>
> a) use a CPU wait with a small absolute timeout, and display the 
> previous content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional 
> rendering will choose between the latest content (if signalled) and 
> previous content (if timed out)
>
> The result would be that the desktop can run close to 60 fps even if 
> an app runs at 1 fps.
>
> *Redefining imported/exported fences and breaking some users/OSs is 
> the only way to have userspace GPU command submission, and the 
> deadlock example here is the counterexample proving that there is no 
> other way.*
>
> So, what are the chances this is going to fly with the ecosystem?
>
> Thanks,
> Marek


[-- Attachment #2: Type: text/html, Size: 6909 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-05-28 14:41 ` Christian König
@ 2021-05-28 22:25   ` Marek Olšák
  2021-05-29  3:33     ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-05-28 22:25 UTC (permalink / raw)
  To: Christian König; +Cc: Jason Ekstrand, dri-devel, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 5331 bytes --]

If both implicit and explicit synchronization are handled the same, then
the kernel won't be able to identify the process that caused an implicit
sync deadlock. The process that is stuck waiting for a fence can be
innocent, and the kernel can't punish it. Likewise, the GPU reset guery
that reports which process is guilty and innocent will only be able to
report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Hi Marek,
>
> well I don't think that implicit and explicit synchronization needs to be
> mutual exclusive.
>
> What we should do is to have the ability to transport an synchronization
> object with each BO.
>
> Implicit and explicit synchronization then basically become the same, they
> just transport the synchronization object differently.
>
> The biggest problem are the sync_files for Android, since they are really
> not easy to support at all. If Android wants to support user queues we
> would probably have to do some changes there.
>
> Regards,
> Christian.
>
> Am 27.05.21 um 23:51 schrieb Marek Olšák:
>
> Hi,
>
> Since Christian believes that we can't deadlock the kernel with some
> changes there, we just need to make everything nice for userspace too.
> Instead of explaining how it will work, I will explain the cases where
> future hardware (and its kernel driver) will break existing userspace in
> order to protect everybody from deadlocks. Anything that uses implicit sync
> will be spared, so X and Wayland will be fine, assuming they don't
> import/export fences. Those use cases that do import/export fences might or
> might not work, depending on how the fences are used.
>
> One of the necessities is that all fences will become future fences. The
> semantics of imported/exported fences will change completely and will have
> new restrictions on the usage. The restrictions are:
>
>
> 1) Android sync files will be impossible to support, so won't be
> supported. (they don't allow future fences)
>
>
> 2) Implicit sync and explicit sync will be mutually exclusive between
> process. A process can either use one or the other, but not both. This is
> meant to prevent a deadlock condition with future fences where any process
> can malevolently deadlock execution of any other process, even execution of
> a higher-privileged process. The kernel will impose the following
> restrictions to protect against the deadlock:
>
> a) a process with an implicitly-sync'd imported/exported buffer can't
> import/export a fence from/to another process
> b) a process with an imported/exported fence can't import/export an
> implicitly-sync'd buffer from/to another process
>
> Alternative: A higher-privileged process could enforce both restrictions
> instead of the kernel to protect itself from the deadlock, but this would
> be a can of worms for existing userspace. It would be better if the kernel
> just broke unsafe userspace on future hw, just like sync files.
>
> If both implicit and explicit sync are allowed to occur simultaneously,
> sending a future fence that will never signal to any process will deadlock
> that process after it acquires the implicit sync lock, which is a sequence
> number that the process is required to write to memory and send an
> interrupt from the GPU in a finite time. This is how the deadlock can
> happen:
>
> * The process gets sequence number N from the kernel for an
> implicitly-sync'd buffer.
> * The process inserts (into the GPU user-mapped queue) a wait for sequence
> number N-1.
> * The process inserts a wait for a fence, but it doesn't know that it will
> never signal ==> deadlock.
> ...
> * The process inserts a command to write sequence number N to a
> predetermined memory location. (which will make the buffer idle and send an
> interrupt to the kernel)
> ...
> * The kernel will terminate the process because it has never received the
> interrupt. (i.e. a less-privileged process just killed a more-privileged
> process)
>
> It's the interrupt for implicit sync that never arrived that caused the
> termination, and the only way another process can cause it is by sending a
> fence that will never signal. Thus, importing/exporting fences from/to
> other processes can't be allowed simultaneously with implicit sync.
>
>
> 3) Compositors (and other privileged processes, and display flipping)
> can't trust imported/exported fences. They need a timeout recovery
> mechanism from the beginning, and the following are some possible solutions
> to timeouts:
>
> a) use a CPU wait with a small absolute timeout, and display the previous
> content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional rendering
> will choose between the latest content (if signalled) and previous content
> (if timed out)
>
> The result would be that the desktop can run close to 60 fps even if an
> app runs at 1 fps.
>
> *Redefining imported/exported fences and breaking some users/OSs is the
> only way to have userspace GPU command submission, and the deadlock example
> here is the counterexample proving that there is no other way.*
>
> So, what are the chances this is going to fly with the ecosystem?
>
> Thanks,
> Marek
>
>
>

[-- Attachment #2: Type: text/html, Size: 7697 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-05-28 22:25   ` Marek Olšák
@ 2021-05-29  3:33     ` Marek Olšák
  2021-05-31  8:25       ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-05-29  3:33 UTC (permalink / raw)
  To: Christian König; +Cc: Jason Ekstrand, dri-devel, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 6866 bytes --]

My first email can be ignored except for the sync files. Oh well.

I think I see what you mean, Christian. If we assume that an imported fence
is always read only (the buffer with the sequence number is read only),
only the process that created and exported the fence can signal it. If the
fence is not signaled, the exporting process is guilty. The only thing the
importing process must do when it's about to use the fence as a dependency
is to notify the kernel about it. Thus, the kernel will always know the
dependency graph. Then if the importing process times out, the kernel will
blame any of the processes that passed it a fence that is still unsignaled.
The kernel will blame the process that timed out only if all imported
fences have been signaled. It seems pretty robust.

It's the same with implicit sync except that the buffer with the sequence
number is writable. Any process that has an implicitly-sync'd buffer can
set the sequence number to 0 or UINT64_MAX. 0 will cause a timeout for the
next job, while UINT64_MAX might cause a timeout a little later. The
timeout can be mitigated by the kernel because the kernel knows the
greatest number that should be there, but it's not possible to know which
process is guilty (all processes holding the buffer handle would be
suspects).

Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák <maraeo@gmail.com> wrote:

> If both implicit and explicit synchronization are handled the same, then
> the kernel won't be able to identify the process that caused an implicit
> sync deadlock. The process that is stuck waiting for a fence can be
> innocent, and the kernel can't punish it. Likewise, the GPU reset guery
> that reports which process is guilty and innocent will only be able to
> report unknown. Is that OK?
>
> Marek
>
> On Fri, May 28, 2021 at 10:41 AM Christian König <
> ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Hi Marek,
>>
>> well I don't think that implicit and explicit synchronization needs to be
>> mutual exclusive.
>>
>> What we should do is to have the ability to transport an synchronization
>> object with each BO.
>>
>> Implicit and explicit synchronization then basically become the same,
>> they just transport the synchronization object differently.
>>
>> The biggest problem are the sync_files for Android, since they are really
>> not easy to support at all. If Android wants to support user queues we
>> would probably have to do some changes there.
>>
>> Regards,
>> Christian.
>>
>> Am 27.05.21 um 23:51 schrieb Marek Olšák:
>>
>> Hi,
>>
>> Since Christian believes that we can't deadlock the kernel with some
>> changes there, we just need to make everything nice for userspace too.
>> Instead of explaining how it will work, I will explain the cases where
>> future hardware (and its kernel driver) will break existing userspace in
>> order to protect everybody from deadlocks. Anything that uses implicit sync
>> will be spared, so X and Wayland will be fine, assuming they don't
>> import/export fences. Those use cases that do import/export fences might or
>> might not work, depending on how the fences are used.
>>
>> One of the necessities is that all fences will become future fences. The
>> semantics of imported/exported fences will change completely and will have
>> new restrictions on the usage. The restrictions are:
>>
>>
>> 1) Android sync files will be impossible to support, so won't be
>> supported. (they don't allow future fences)
>>
>>
>> 2) Implicit sync and explicit sync will be mutually exclusive between
>> process. A process can either use one or the other, but not both. This is
>> meant to prevent a deadlock condition with future fences where any process
>> can malevolently deadlock execution of any other process, even execution of
>> a higher-privileged process. The kernel will impose the following
>> restrictions to protect against the deadlock:
>>
>> a) a process with an implicitly-sync'd imported/exported buffer can't
>> import/export a fence from/to another process
>> b) a process with an imported/exported fence can't import/export an
>> implicitly-sync'd buffer from/to another process
>>
>> Alternative: A higher-privileged process could enforce both restrictions
>> instead of the kernel to protect itself from the deadlock, but this would
>> be a can of worms for existing userspace. It would be better if the kernel
>> just broke unsafe userspace on future hw, just like sync files.
>>
>> If both implicit and explicit sync are allowed to occur simultaneously,
>> sending a future fence that will never signal to any process will deadlock
>> that process after it acquires the implicit sync lock, which is a sequence
>> number that the process is required to write to memory and send an
>> interrupt from the GPU in a finite time. This is how the deadlock can
>> happen:
>>
>> * The process gets sequence number N from the kernel for an
>> implicitly-sync'd buffer.
>> * The process inserts (into the GPU user-mapped queue) a wait for
>> sequence number N-1.
>> * The process inserts a wait for a fence, but it doesn't know that it
>> will never signal ==> deadlock.
>> ...
>> * The process inserts a command to write sequence number N to a
>> predetermined memory location. (which will make the buffer idle and send an
>> interrupt to the kernel)
>> ...
>> * The kernel will terminate the process because it has never received the
>> interrupt. (i.e. a less-privileged process just killed a more-privileged
>> process)
>>
>> It's the interrupt for implicit sync that never arrived that caused the
>> termination, and the only way another process can cause it is by sending a
>> fence that will never signal. Thus, importing/exporting fences from/to
>> other processes can't be allowed simultaneously with implicit sync.
>>
>>
>> 3) Compositors (and other privileged processes, and display flipping)
>> can't trust imported/exported fences. They need a timeout recovery
>> mechanism from the beginning, and the following are some possible solutions
>> to timeouts:
>>
>> a) use a CPU wait with a small absolute timeout, and display the previous
>> content on timeout
>> b) use a GPU wait with a small absolute timeout, and conditional
>> rendering will choose between the latest content (if signalled) and
>> previous content (if timed out)
>>
>> The result would be that the desktop can run close to 60 fps even if an
>> app runs at 1 fps.
>>
>> *Redefining imported/exported fences and breaking some users/OSs is the
>> only way to have userspace GPU command submission, and the deadlock example
>> here is the counterexample proving that there is no other way.*
>>
>> So, what are the chances this is going to fly with the ecosystem?
>>
>> Thanks,
>> Marek
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 9478 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-05-29  3:33     ` Marek Olšák
@ 2021-05-31  8:25       ` Christian König
  0 siblings, 0 replies; 50+ messages in thread
From: Christian König @ 2021-05-31  8:25 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Jason Ekstrand, dri-devel, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 7875 bytes --]

Yes, exactly that's my thinking and also the reason why I'm pondering so 
hard on the requirement that the memory for shared user fences should 
not be modifiable by userspace directly.

Christian.

Am 29.05.21 um 05:33 schrieb Marek Olšák:
> My first email can be ignored except for the sync files. Oh well.
>
> I think I see what you mean, Christian. If we assume that an imported 
> fence is always read only (the buffer with the sequence number is read 
> only), only the process that created and exported the fence can signal 
> it. If the fence is not signaled, the exporting process is guilty. The 
> only thing the importing process must do when it's about to use the 
> fence as a dependency is to notify the kernel about it. Thus, the 
> kernel will always know the dependency graph. Then if the importing 
> process times out, the kernel will blame any of the processes that 
> passed it a fence that is still unsignaled. The kernel will blame the 
> process that timed out only if all imported fences have been signaled. 
> It seems pretty robust.
>
> It's the same with implicit sync except that the buffer with the 
> sequence number is writable. Any process that has an implicitly-sync'd 
> buffer can set the sequence number to 0 or UINT64_MAX. 0 will cause a 
> timeout for the next job, while UINT64_MAX might cause a timeout a 
> little later. The timeout can be mitigated by the kernel because the 
> kernel knows the greatest number that should be there, but it's not 
> possible to know which process is guilty (all processes holding the 
> buffer handle would be suspects).
>
> Marek
>
> On Fri, May 28, 2021 at 6:25 PM Marek Olšák <maraeo@gmail.com 
> <mailto:maraeo@gmail.com>> wrote:
>
>     If both implicit and explicit synchronization are handled the
>     same, then the kernel won't be able to identify the process that
>     caused an implicit sync deadlock. The process that is stuck
>     waiting for a fence can be innocent, and the kernel can't punish
>     it. Likewise, the GPU reset guery that reports which process is
>     guilty and innocent will only be able to report unknown. Is that OK?
>
>     Marek
>
>     On Fri, May 28, 2021 at 10:41 AM Christian König
>     <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>         Hi Marek,
>
>         well I don't think that implicit and explicit synchronization
>         needs to be mutual exclusive.
>
>         What we should do is to have the ability to transport an
>         synchronization object with each BO.
>
>         Implicit and explicit synchronization then basically become
>         the same, they just transport the synchronization object
>         differently.
>
>         The biggest problem are the sync_files for Android, since they
>         are really not easy to support at all. If Android wants to
>         support user queues we would probably have to do some changes
>         there.
>
>         Regards,
>         Christian.
>
>         Am 27.05.21 um 23:51 schrieb Marek Olšák:
>>         Hi,
>>
>>         Since Christian believes that we can't deadlock the kernel
>>         with some changes there, we just need to make everything nice
>>         for userspace too. Instead of explaining how it will work, I
>>         will explain the cases where future hardware (and its kernel
>>         driver) will break existing userspace in order to protect
>>         everybody from deadlocks. Anything that uses implicit sync
>>         will be spared, so X and Wayland will be fine, assuming they
>>         don't import/export fences. Those use cases that do
>>         import/export fences might or might not work, depending on
>>         how the fences are used.
>>
>>         One of the necessities is that all fences will become future
>>         fences. The semantics of imported/exported fences will change
>>         completely and will have new restrictions on the usage. The
>>         restrictions are:
>>
>>
>>         1) Android sync files will be impossible to support, so won't
>>         be supported. (they don't allow future fences)
>>
>>
>>         2) Implicit sync and explicit sync will be mutually exclusive
>>         between process. A process can either use one or the other,
>>         but not both. This is meant to prevent a deadlock condition
>>         with future fences where any process can malevolently
>>         deadlock execution of any other process, even execution of a
>>         higher-privileged process. The kernel will impose the
>>         following restrictions to protect against the deadlock:
>>
>>         a) a process with an implicitly-sync'd imported/exported
>>         buffer can't import/export a fence from/to another process
>>         b) a process with an imported/exported fence can't
>>         import/export an implicitly-sync'd buffer from/to another process
>>
>>         Alternative: A higher-privileged process could enforce both
>>         restrictions instead of the kernel to protect itself from the
>>         deadlock, but this would be a can of worms for existing
>>         userspace. It would be better if the kernel just broke unsafe
>>         userspace on future hw, just like sync files.
>>
>>         If both implicit and explicit sync are allowed to occur
>>         simultaneously, sending a future fence that will never signal
>>         to any process will deadlock that process after it acquires
>>         the implicit sync lock, which is a sequence number that the
>>         process is required to write to memory and send an interrupt
>>         from the GPU in a finite time. This is how the deadlock can
>>         happen:
>>
>>         * The process gets sequence number N from the kernel for an
>>         implicitly-sync'd buffer.
>>         * The process inserts (into the GPU user-mapped queue) a wait
>>         for sequence number N-1.
>>         * The process inserts a wait for a fence, but it doesn't know
>>         that it will never signal ==> deadlock.
>>         ...
>>         * The process inserts a command to write sequence number N to
>>         a predetermined memory location. (which will make the buffer
>>         idle and send an interrupt to the kernel)
>>         ...
>>         * The kernel will terminate the process because it has never
>>         received the interrupt. (i.e. a less-privileged process just
>>         killed a more-privileged process)
>>
>>         It's the interrupt for implicit sync that never arrived that
>>         caused the termination, and the only way another process can
>>         cause it is by sending a fence that will never signal. Thus,
>>         importing/exporting fences from/to other processes can't be
>>         allowed simultaneously with implicit sync.
>>
>>
>>         3) Compositors (and other privileged processes, and display
>>         flipping) can't trust imported/exported fences. They need a
>>         timeout recovery mechanism from the beginning, and the
>>         following are some possible solutions to timeouts:
>>
>>         a) use a CPU wait with a small absolute timeout, and display
>>         the previous content on timeout
>>         b) use a GPU wait with a small absolute timeout, and
>>         conditional rendering will choose between the latest content
>>         (if signalled) and previous content (if timed out)
>>
>>         The result would be that the desktop can run close to 60 fps
>>         even if an app runs at 1 fps.
>>
>>         *Redefining imported/exported fences and breaking some
>>         users/OSs is the only way to have userspace GPU command
>>         submission, and the deadlock example here is the
>>         counterexample proving that there is no other way.*
>>
>>         So, what are the chances this is going to fly with the ecosystem?
>>
>>         Thanks,
>>         Marek
>


[-- Attachment #2: Type: text/html, Size: 13018 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-05-27 21:51 Linux Graphics Next: Userspace submission update Marek Olšák
  2021-05-28 14:41 ` Christian König
@ 2021-06-01  9:02 ` Michel Dänzer
  2021-06-01 10:21   ` Christian König
  1 sibling, 1 reply; 50+ messages in thread
From: Michel Dänzer @ 2021-06-01  9:02 UTC (permalink / raw)
  To: Marek Olšák, dri-devel, ML Mesa-dev, Alex Deucher,
	Christian König, Daniel Vetter, Dave Airlie, Jason Ekstrand,
	Bas Nieuwenhuizen

On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> 
> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
> 
> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
> 
> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-06-01  9:02 ` Michel Dänzer
@ 2021-06-01 10:21   ` Christian König
  2021-06-01 10:49     ` Michel Dänzer
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-01 10:21 UTC (permalink / raw)
  To: Michel Dänzer, Marek Olšák, dri-devel,
	ML Mesa-dev, Alex Deucher, Daniel Vetter, Dave Airlie,
	Jason Ekstrand, Bas Nieuwenhuizen

Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>
>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>
>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
> FWIW, this is working with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to 
become available? The former is a bit bad for latency and power management.

Another question is if that is sufficient as security for the display 
server or if we need further handling down the road? I mean essentially 
we are moving the reliability problem into the display server.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-06-01 10:21   ` Christian König
@ 2021-06-01 10:49     ` Michel Dänzer
  2021-06-01 12:10       ` [Mesa-dev] " Christian König
  2021-06-02  8:09       ` Michel Dänzer
  0 siblings, 2 replies; 50+ messages in thread
From: Michel Dänzer @ 2021-06-01 10:49 UTC (permalink / raw)
  To: Christian König, Marek Olšák, dri-devel,
	ML Mesa-dev, Alex Deucher, Daniel Vetter, Dave Airlie,
	Jason Ekstrand, Bas Nieuwenhuizen

On 2021-06-01 12:21 p.m., Christian König wrote:
> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>>
>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>>
>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>> FWIW, this is working with
>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
> 
> Yeah, that is really nice to have.
> 
> One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

> The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)


> Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 10:49     ` Michel Dänzer
@ 2021-06-01 12:10       ` Christian König
  2021-06-01 12:30         ` Daniel Vetter
  2021-06-01 13:18         ` Michel Dänzer
  2021-06-02  8:09       ` Michel Dänzer
  1 sibling, 2 replies; 50+ messages in thread
From: Christian König @ 2021-06-01 12:10 UTC (permalink / raw)
  To: Michel Dänzer, Marek Olšák, dri-devel,
	ML Mesa-dev, Alex Deucher, Daniel Vetter, Dave Airlie,
	Jason Ekstrand, Bas Nieuwenhuizen

Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> On 2021-06-01 12:21 p.m., Christian König wrote:
>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>>>
>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>>>
>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>>> FWIW, this is working with
>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>> Yeah, that is really nice to have.
>>
>> One question is if you wait on the CPU or the GPU for the new surface to become available?
> It's based on polling dma-buf fds, i.e. CPU.
>
>> The former is a bit bad for latency and power management.
> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
>
> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That 
essentially means that the client application needs to be able to spit 
out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client 
application has only .9ms left for it's frame which is basically 
impossible to do.

See for the user fences handling the display engine will learn to read 
sequence numbers from memory and decide on it's own if the old frame or 
the new one is scanned out. To get the latency there as low as possible.

>> Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
> Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
>
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 12:10       ` [Mesa-dev] " Christian König
@ 2021-06-01 12:30         ` Daniel Vetter
  2021-06-01 12:51           ` Christian König
  2021-06-01 13:18         ` Michel Dänzer
  1 sibling, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-01 12:30 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Tue, Jun 1, 2021 at 2:10 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> >> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> >>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
> >>>>
> >>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
> >>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
> >>>>
> >>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
> >>> FWIW, this is working with
> >>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
> >> Yeah, that is really nice to have.
> >>
> >> One question is if you wait on the CPU or the GPU for the new surface to become available?
> > It's based on polling dma-buf fds, i.e. CPU.
> >
> >> The former is a bit bad for latency and power management.
> > There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
> >
> > Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
>
> Well let me describe it like this:
>
> We have an use cases for 144 Hz guaranteed refresh rate. That
> essentially means that the client application needs to be able to spit
> out one frame/window content every ~6.9ms. That's tough, but doable.
>
> When you now add 6ms latency in the compositor that means the client
> application has only .9ms left for it's frame which is basically
> impossible to do.
>
> See for the user fences handling the display engine will learn to read
> sequence numbers from memory and decide on it's own if the old frame or
> the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.

I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.
Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.
-Daniel

> >> Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
> > Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 12:30         ` Daniel Vetter
@ 2021-06-01 12:51           ` Christian König
  2021-06-01 13:01             ` Marek Olšák
  2021-06-02  8:57             ` Daniel Stone
  0 siblings, 2 replies; 50+ messages in thread
From: Christian König @ 2021-06-01 12:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> On Tue, Jun 1, 2021 at 2:10 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>>> On 2021-06-01 12:21 p.m., Christian König wrote:
>>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>>>>>
>>>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>>>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>>>>>
>>>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>>>>> FWIW, this is working with
>>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>>>> Yeah, that is really nice to have.
>>>>
>>>> One question is if you wait on the CPU or the GPU for the new surface to become available?
>>> It's based on polling dma-buf fds, i.e. CPU.
>>>
>>>> The former is a bit bad for latency and power management.
>>> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
>>>
>>> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
>> Well let me describe it like this:
>>
>> We have an use cases for 144 Hz guaranteed refresh rate. That
>> essentially means that the client application needs to be able to spit
>> out one frame/window content every ~6.9ms. That's tough, but doable.
>>
>> When you now add 6ms latency in the compositor that means the client
>> application has only .9ms left for it's frame which is basically
>> impossible to do.
>>
>> See for the user fences handling the display engine will learn to read
>> sequence numbers from memory and decide on it's own if the old frame or
>> the new one is scanned out. To get the latency there as low as possible.
> This won't work with implicit sync at all.
>
> If you want to enable this use-case with driver magic and without the
> compositor being aware of what's going on, the solution is EGLStreams.
> Not sure we want to go there, but it's definitely a lot more feasible
> than trying to stuff eglstreams semantics into dma-buf implicit
> fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try 
with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need 
to support that we the newest hardware generation.

> I still think the most reasonable approach here is that we wrap a
> dma_fence compat layer/mode over new hw for existing
> userspace/compositors. And then enable userspace memory fences and the
> fancy new features those allow with a new model that's built for them.

Yeah, that's basically the same direction I'm heading. Question is how 
to fix all those details.

> Also even with dma_fence we could implement your model of staying with
> the previous buffer (or an older buffer at that's already rendered),
> but it needs explicit involvement of the compositor. At least without
> adding eglstreams fd to the kernel and wiring up all the relevant
> extensions.

Question is do we already have some extension which allows different 
textures to be selected on the fly depending on some state?

E.g. something like use new frame if it's available and old frame otherwise.

If you then apply this to the standard dma_fence based hardware or the 
new user fence based one is then pretty much irrelevant.

Regards,
Christian.

> -Daniel
>
>>>> Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
>>> Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
>>>
>>>
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 12:51           ` Christian König
@ 2021-06-01 13:01             ` Marek Olšák
  2021-06-01 13:24               ` Michel Dänzer
  2021-06-02  8:57             ` Daniel Stone
  1 sibling, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-01 13:01 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Ekstrand, Michel Dänzer, dri-devel, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 5545 bytes --]

On Tue., Jun. 1, 2021, 08:51 Christian König, <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > On Tue, Jun 1, 2021 at 2:10 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >>> On 2021-06-01 12:21 p.m., Christian König wrote:
> >>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> >>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >>>>>> 3) Compositors (and other privileged processes, and display
> flipping) can't trust imported/exported fences. They need a timeout
> recovery mechanism from the beginning, and the following are some possible
> solutions to timeouts:
> >>>>>>
> >>>>>> a) use a CPU wait with a small absolute timeout, and display the
> previous content on timeout
> >>>>>> b) use a GPU wait with a small absolute timeout, and conditional
> rendering will choose between the latest content (if signalled) and
> previous content (if timed out)
> >>>>>>
> >>>>>> The result would be that the desktop can run close to 60 fps even
> if an app runs at 1 fps.
> >>>>> FWIW, this is working with
> >>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even
> with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to
> provide the same dma-buf poll semantics as other drivers and high priority
> GFX contexts via EGL_IMG_context_priority which can preempt lower priority
> ones).
> >>>> Yeah, that is really nice to have.
> >>>>
> >>>> One question is if you wait on the CPU or the GPU for the new surface
> to become available?
> >>> It's based on polling dma-buf fds, i.e. CPU.
> >>>
> >>>> The former is a bit bad for latency and power management.
> >>> There isn't a choice for Wayland compositors in general, since there
> can be arbitrary other state which needs to be applied atomically together
> with the new buffer. (Though in theory, a compositor might get fancy and
> special-case surface commits which can be handled by waiting on the GPU)
> >>>
> >>> Latency is largely a matter of scheduling in the compositor. The
> latency incurred by the compositor shouldn't have to be more than
> single-digit milliseconds. (I've seen total latency from when the client
> starts processing a (static) frame to when it starts being scanned out as
> low as ~6 ms with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than
> typical with Xorg)
> >> Well let me describe it like this:
> >>
> >> We have an use cases for 144 Hz guaranteed refresh rate. That
> >> essentially means that the client application needs to be able to spit
> >> out one frame/window content every ~6.9ms. That's tough, but doable.
> >>
> >> When you now add 6ms latency in the compositor that means the client
> >> application has only .9ms left for it's frame which is basically
> >> impossible to do.
> >>
> >> See for the user fences handling the display engine will learn to read
> >> sequence numbers from memory and decide on it's own if the old frame or
> >> the new one is scanned out. To get the latency there as low as possible.
> > This won't work with implicit sync at all.
> >
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.
>
> Well not changing compositors is certainly not something I would try
> with this use case.
>
> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> to support that we the newest hardware generation.
>
> > I still think the most reasonable approach here is that we wrap a
> > dma_fence compat layer/mode over new hw for existing
> > userspace/compositors. And then enable userspace memory fences and the
> > fancy new features those allow with a new model that's built for them.
>
> Yeah, that's basically the same direction I'm heading. Question is how
> to fix all those details.
>
> > Also even with dma_fence we could implement your model of staying with
> > the previous buffer (or an older buffer at that's already rendered),
> > but it needs explicit involvement of the compositor. At least without
> > adding eglstreams fd to the kernel and wiring up all the relevant
> > extensions.
>
> Question is do we already have some extension which allows different
> textures to be selected on the fly depending on some state?
>

There is no such extension for sync objects, but it can be done with
queries, like occlusion queries. There is also no timeout option and it can
only do "if" and "if not", but not "if .. else"

Marek



> E.g. something like use new frame if it's available and old frame
> otherwise.
>
> If you then apply this to the standard dma_fence based hardware or the
> new user fence based one is then pretty much irrelevant.
>
> Regards,
> Christian.
>
> > -Daniel
> >
> >>>> Another question is if that is sufficient as security for the display
> server or if we need further handling down the road? I mean essentially we
> are moving the reliability problem into the display server.
> >>> Good question. This should generally protect the display server from
> freezing due to client fences never signalling, but there might still be
> corner cases.
> >>>
> >>>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 7111 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 12:10       ` [Mesa-dev] " Christian König
  2021-06-01 12:30         ` Daniel Vetter
@ 2021-06-01 13:18         ` Michel Dänzer
  2021-06-01 17:39           ` Michel Dänzer
  2021-06-01 17:42           ` Daniel Stone
  1 sibling, 2 replies; 50+ messages in thread
From: Michel Dänzer @ 2021-06-01 13:18 UTC (permalink / raw)
  To: Christian König, Marek Olšák, dri-devel,
	ML Mesa-dev, Alex Deucher, Daniel Vetter, Dave Airlie,
	Jason Ekstrand, Bas Nieuwenhuizen

On 2021-06-01 2:10 p.m., Christian König wrote:
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>> On 2021-06-01 12:21 p.m., Christian König wrote:
>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>>>>
>>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>>>>
>>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>>>> FWIW, this is working with
>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>>> Yeah, that is really nice to have.
>>>
>>> One question is if you wait on the CPU or the GPU for the new surface to become available?
>> It's based on polling dma-buf fds, i.e. CPU.
>>
>>> The former is a bit bad for latency and power management.
>> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
>>
>> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
> 
> Well let me describe it like this:
> 
> We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
> 
> When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 13:01             ` Marek Olšák
@ 2021-06-01 13:24               ` Michel Dänzer
  0 siblings, 0 replies; 50+ messages in thread
From: Michel Dänzer @ 2021-06-01 13:24 UTC (permalink / raw)
  To: Marek Olšák, Christian König
  Cc: ML Mesa-dev, dri-devel, Jason Ekstrand

On 2021-06-01 3:01 p.m., Marek Olšák wrote:
> 
> 
> On Tue., Jun. 1, 2021, 08:51 Christian König, <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> 
>     Am 01.06.21 um 14:30 schrieb Daniel Vetter:
>     > On Tue, Jun 1, 2021 at 2:10 PM Christian König
>     > <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>     >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>     >>> On 2021-06-01 12:21 p.m., Christian König wrote:
>     >>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>     >>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>     >>>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>     >>>>>>
>     >>>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>     >>>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>     >>>>>>
>     >>>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>     >>>>> FWIW, this is working with
>     >>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880> , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>     >>>> Yeah, that is really nice to have.
>     >>>>
>     >>>> One question is if you wait on the CPU or the GPU for the new surface to become available?
>     >>> It's based on polling dma-buf fds, i.e. CPU.
>     >>>
>     >>>> The former is a bit bad for latency and power management.
>     >>> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
>     >>>
>     >>> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620 <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620>, lower than typical with Xorg)
>     >> Well let me describe it like this:
>     >>
>     >> We have an use cases for 144 Hz guaranteed refresh rate. That
>     >> essentially means that the client application needs to be able to spit
>     >> out one frame/window content every ~6.9ms. That's tough, but doable.
>     >>
>     >> When you now add 6ms latency in the compositor that means the client
>     >> application has only .9ms left for it's frame which is basically
>     >> impossible to do.
>     >>
>     >> See for the user fences handling the display engine will learn to read
>     >> sequence numbers from memory and decide on it's own if the old frame or
>     >> the new one is scanned out. To get the latency there as low as possible.
>     > This won't work with implicit sync at all.
>     >
>     > If you want to enable this use-case with driver magic and without the
>     > compositor being aware of what's going on, the solution is EGLStreams.
>     > Not sure we want to go there, but it's definitely a lot more feasible
>     > than trying to stuff eglstreams semantics into dma-buf implicit
>     > fencing support in a desperate attempt to not change compositors.

EGLStreams are a red herring here. Wayland has atomic state transactions, similar to the atomic KMS API. These semantics could be achieved even with EGLStreams, at least via additional EGL extensions.

Any fancy technique we're discussing here would have to be completely between the Wayland compositor and the kernel, transparent to anything else.


>     > I still think the most reasonable approach here is that we wrap a
>     > dma_fence compat layer/mode over new hw for existing
>     > userspace/compositors. And then enable userspace memory fences and the
>     > fancy new features those allow with a new model that's built for them.
> 
>     Yeah, that's basically the same direction I'm heading. Question is how
>     to fix all those details.
> 
>     > Also even with dma_fence we could implement your model of staying with
>     > the previous buffer (or an older buffer at that's already rendered),
>     > but it needs explicit involvement of the compositor. At least without
>     > adding eglstreams fd to the kernel and wiring up all the relevant
>     > extensions.
> 
>     Question is do we already have some extension which allows different
>     textures to be selected on the fly depending on some state?
> 
> 
> There is no such extension for sync objects, but it can be done with queries, like occlusion queries. There is also no timeout option and it can only do "if" and "if not", but not "if .. else"

The Wayland compositor would also need to know which texture the shader ended up sampling from.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 13:18         ` Michel Dänzer
@ 2021-06-01 17:39           ` Michel Dänzer
  2021-06-01 17:42           ` Daniel Stone
  1 sibling, 0 replies; 50+ messages in thread
From: Michel Dänzer @ 2021-06-01 17:39 UTC (permalink / raw)
  To: Christian König, Marek Olšák, dri-devel,
	ML Mesa-dev, Alex Deucher, Daniel Vetter, Dave Airlie,
	Jason Ekstrand, Bas Nieuwenhuizen

On 2021-06-01 3:18 p.m., Michel Dänzer wrote:
> On 2021-06-01 2:10 p.m., Christian König wrote:
>> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>>> On 2021-06-01 12:21 p.m., Christian König wrote:
>>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>>>>>
>>>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>>>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>>>>>
>>>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>>>>> FWIW, this is working with
>>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>>>> Yeah, that is really nice to have.
>>>>
>>>> One question is if you wait on the CPU or the GPU for the new surface to become available?
>>> It's based on polling dma-buf fds, i.e. CPU.
>>>
>>>> The former is a bit bad for latency and power management.
>>> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
>>>
>>> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
>>
>> Well let me describe it like this:
>>
>> We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
>>
>> When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
> 
> You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.

Note that what I wrote above is about the case where the compositor needs to draw its own frame sampling from the client buffer. If your concern is about a fullscreen application for which the compositor can directly use the application buffers for scanout, it should be possible in theory to get the latency incurred by the compositor down to ~1 ms.

If that's too much[0], it could be improved further by adding atomic KMS API to replace a pending page flip with another one. Then the compositor could just directly submit a flip as soon as a new buffer becomes ready (or even as soon as the client submits it to the compositor, depending on how exactly the new KMS API works). Then the minimum latency should be mostly up to the kernel driver / HW.

Another possibility would be for the application to use KMS directly, e.g. via a DRM lease. That might still require the same new API to get the flip submission latency significantly below 1 ms though.


[0] Though I'm not sure how to reconcile that with "spitting out one frame every ~6.9ms is tough", as that means the theoretical minimum total client→scanout latency is ~7 ms (and missing a scanout cycle ~doubles the latency).

-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 13:18         ` Michel Dänzer
  2021-06-01 17:39           ` Michel Dänzer
@ 2021-06-01 17:42           ` Daniel Stone
  1 sibling, 0 replies; 50+ messages in thread
From: Daniel Stone @ 2021-06-01 17:42 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Jason Ekstrand, Marek Olšák, Christian König,
	dri-devel, ML Mesa-dev

Hi,

On Tue, 1 Jun 2021 at 14:18, Michel Dänzer <michel@daenzer.net> wrote:
> On 2021-06-01 2:10 p.m., Christian König wrote:
> > Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Yeah, this is pretty crucial.

> >> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
> >
> > Well let me describe it like this:
> >
> > We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
> >
> > When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
>
> You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.

Not quite.

When weston-presentation-shm is reporting is a 6ms delta between when
it started its rendering loop and when the frame was presented to
screen. How w-p-s was run matters a lot, because you can insert an
arbitrary delay in there to simulate client rendering. It also matters
a lot that the client is SHM, because that will result in Mutter doing
glTexImage2D on whatever size the window is, then doing a full GL
compositing pass, so even if it's run with zero delay, 6ms isn't 'the
amount of time it takes Mutter to get a frame to screen', it's
measuring the overhead of a texture upload and full-screen composition
as well.

I'm assuming the 'guaranteed 144Hz' target is a fullscreen GL client,
for which you definitely avoid TexImage2D, and could hopefully
(assuming the client picks a modifier which can be displayed) also
avoid the composition pass in favour of direct scanout from the client
buffer; that would give you a much lower number.

Each compositor has its own heuristics around timing. They all make
their own tradeoff between low latency and fewest dropped frames.
Obviously, the higher your latency, the lower the chance of missing a
deadline. There's a lot of detail in the MR that Michel linked.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-06-01 10:49     ` Michel Dänzer
  2021-06-01 12:10       ` [Mesa-dev] " Christian König
@ 2021-06-02  8:09       ` Michel Dänzer
  2021-06-02 19:20         ` Daniel Vetter
  1 sibling, 1 reply; 50+ messages in thread
From: Michel Dänzer @ 2021-06-02  8:09 UTC (permalink / raw)
  To: Christian König, Marek Olšák, dri-devel,
	ML Mesa-dev, Alex Deucher, Daniel Vetter, Dave Airlie,
	Jason Ekstrand, Bas Nieuwenhuizen

On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
> On 2021-06-01 12:21 p.m., Christian König wrote:
> 
>> Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
> 
> Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

E.g. a client might be able to sneak in a fence between when the compositor checks fences and when it submits its drawing to the kernel.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-01 12:51           ` Christian König
  2021-06-01 13:01             ` Marek Olšák
@ 2021-06-02  8:57             ` Daniel Stone
  2021-06-02  9:34               ` Marek Olšák
  2021-06-02  9:44               ` Christian König
  1 sibling, 2 replies; 50+ messages in thread
From: Daniel Stone @ 2021-06-02  8:57 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.
>
> Well not changing compositors is certainly not something I would try
> with this use case.
>
> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it
seems to all be predicated on the assumption that Ubuntu's LTS
HWE/backport policy is totally immutable, and that we might need to
make the kernel do backflips to work around that. But ... is it? Has
anyone actually asked them how they feel about this?

I mean, my answer to the first email is 'no, absolutely not' from the
technical perspective (the initial proposal totally breaks current and
future userspace), from a design perspective (it breaks a lot of
usecases which aren't single-vendor GPU+display+codec, or aren't just
a simple desktop), and from a sustainability perspective (cutting
Android adrift again isn't acceptable collateral damage to make it
easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The
original email just lists a proposal to break a ton of things, with
proposed replacements which aren't technically viable, and it's not
clear why? Can we please have some more details and some reasoning
behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa
as well as codec APIs) need to do a lot of work to support the new
model. I do really care though that the hard-binary-switch model works
fine enough for AMD but totally breaks heterogeneous systems and makes
it impossible for userspace to do the right thing.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02  8:57             ` Daniel Stone
@ 2021-06-02  9:34               ` Marek Olšák
  2021-06-02  9:38                 ` Marek Olšák
  2021-06-02  9:44               ` Christian König
  1 sibling, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-02  9:34 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 3039 bytes --]

Yes, we can't break anything because we don't want to complicate things for
us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.

Marek

On Wed, Jun 2, 2021 at 4:57 AM Daniel Stone <daniel@fooishbar.org> wrote:

> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> > Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > > If you want to enable this use-case with driver magic and without the
> > > compositor being aware of what's going on, the solution is EGLStreams.
> > > Not sure we want to go there, but it's definitely a lot more feasible
> > > than trying to stuff eglstreams semantics into dma-buf implicit
> > > fencing support in a desperate attempt to not change compositors.
> >
> > Well not changing compositors is certainly not something I would try
> > with this use case.
> >
> > Not changing compositors is more like ok we have Ubuntu 20.04 and need
> > to support that we the newest hardware generation.
>
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?
>
> I mean, my answer to the first email is 'no, absolutely not' from the
> technical perspective (the initial proposal totally breaks current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model works
> fine enough for AMD but totally breaks heterogeneous systems and makes
> it impossible for userspace to do the right thing.
>
> Cheers,
> Daniel
>

[-- Attachment #2: Type: text/html, Size: 3676 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02  9:34               ` Marek Olšák
@ 2021-06-02  9:38                 ` Marek Olšák
  2021-06-02 18:48                   ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-02  9:38 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:

> Yes, we can't break anything because we don't want to complicate things
> for us. It's pretty much all NAK'd already. We are trying to gather more
> knowledge and then make better decisions.
>
> The idea we are considering is that we'll expose memory-based sync objects
> to userspace for read only, and the kernel or hw will strictly control the
> memory writes to those sync objects. The hole in that idea is that
> userspace can decide not to signal a job, so even if userspace can't
> overwrite memory-based sync object states arbitrarily, it can still decide
> not to signal them, and then a future fence is born.
>

This would actually be treated as a GPU hang caused by that context, so it
should be fine.

Marek

[-- Attachment #2: Type: text/html, Size: 1178 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02  8:57             ` Daniel Stone
  2021-06-02  9:34               ` Marek Olšák
@ 2021-06-02  9:44               ` Christian König
  2021-06-02  9:58                 ` Marek Olšák
  1 sibling, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-02  9:44 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

Am 02.06.21 um 10:57 schrieb Daniel Stone:
> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
>>> If you want to enable this use-case with driver magic and without the
>>> compositor being aware of what's going on, the solution is EGLStreams.
>>> Not sure we want to go there, but it's definitely a lot more feasible
>>> than trying to stuff eglstreams semantics into dma-buf implicit
>>> fencing support in a desperate attempt to not change compositors.
>> Well not changing compositors is certainly not something I would try
>> with this use case.
>>
>> Not changing compositors is more like ok we have Ubuntu 20.04 and need
>> to support that we the newest hardware generation.
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?

This was merely an example. What I wanted to say is that we need to 
support system already deployed.

In other words our customers won't accept that they need to replace the 
compositor just because they switch to a new hardware generation.

> I mean, my answer to the first email is 'no, absolutely not' from the
> technical perspective (the initial proposal totally breaks current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model works
> fine enough for AMD but totally breaks heterogeneous systems and makes
> it impossible for userspace to do the right thing.

Well how the handling for new Android, distributions etc... is going to 
look like is a completely different story.

And I completely agree with both Daniel Vetter and you that we need to 
keep this in mind when designing the compatibility with older software.

For Android I'm really not sure what to do. In general Android is 
already trying to do the right thing by using explicit sync, the problem 
is that this is build around the idea that this explicit sync is 
syncfile kernel based.

Either we need to change Android and come up with something that works 
with user fences as well or we somehow invent a compatibility layer for 
syncfile as well.

Regards,
Christian.

>
> Cheers,
> Daniel


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02  9:44               ` Christian König
@ 2021-06-02  9:58                 ` Marek Olšák
  2021-06-02 10:06                   ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-02  9:58 UTC (permalink / raw)
  To: Christian König
  Cc: Michel Dänzer, dri-devel, Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 3539 bytes --]

On Wed, Jun 2, 2021 at 5:44 AM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 02.06.21 um 10:57 schrieb Daniel Stone:
> > Hi Christian,
> >
> > On Tue, 1 Jun 2021 at 13:51, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> >>> If you want to enable this use-case with driver magic and without the
> >>> compositor being aware of what's going on, the solution is EGLStreams.
> >>> Not sure we want to go there, but it's definitely a lot more feasible
> >>> than trying to stuff eglstreams semantics into dma-buf implicit
> >>> fencing support in a desperate attempt to not change compositors.
> >> Well not changing compositors is certainly not something I would try
> >> with this use case.
> >>
> >> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> >> to support that we the newest hardware generation.
> > Serious question, have you talked to Canonical?
> >
> > I mean there's a hell of a lot of effort being expended here, but it
> > seems to all be predicated on the assumption that Ubuntu's LTS
> > HWE/backport policy is totally immutable, and that we might need to
> > make the kernel do backflips to work around that. But ... is it? Has
> > anyone actually asked them how they feel about this?
>
> This was merely an example. What I wanted to say is that we need to
> support system already deployed.
>
> In other words our customers won't accept that they need to replace the
> compositor just because they switch to a new hardware generation.
>
> > I mean, my answer to the first email is 'no, absolutely not' from the
> > technical perspective (the initial proposal totally breaks current and
> > future userspace), from a design perspective (it breaks a lot of
> > usecases which aren't single-vendor GPU+display+codec, or aren't just
> > a simple desktop), and from a sustainability perspective (cutting
> > Android adrift again isn't acceptable collateral damage to make it
> > easier to backport things to last year's Ubuntu release).
> >
> > But then again, I don't even know what I'm NAKing here ... ? The
> > original email just lists a proposal to break a ton of things, with
> > proposed replacements which aren't technically viable, and it's not
> > clear why? Can we please have some more details and some reasoning
> > behind them?
> >
> > I don't mind that userspace (compositor, protocols, clients like Mesa
> > as well as codec APIs) need to do a lot of work to support the new
> > model. I do really care though that the hard-binary-switch model works
> > fine enough for AMD but totally breaks heterogeneous systems and makes
> > it impossible for userspace to do the right thing.
>
> Well how the handling for new Android, distributions etc... is going to
> look like is a completely different story.
>
> And I completely agree with both Daniel Vetter and you that we need to
> keep this in mind when designing the compatibility with older software.
>
> For Android I'm really not sure what to do. In general Android is
> already trying to do the right thing by using explicit sync, the problem
> is that this is build around the idea that this explicit sync is
> syncfile kernel based.
>
> Either we need to change Android and come up with something that works
> with user fences as well or we somehow invent a compatibility layer for
> syncfile as well.
>

What's the issue with syncfiles that syncobjs don't suffer from?

Marek

[-- Attachment #2: Type: text/html, Size: 4364 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02  9:58                 ` Marek Olšák
@ 2021-06-02 10:06                   ` Christian König
  0 siblings, 0 replies; 50+ messages in thread
From: Christian König @ 2021-06-02 10:06 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Michel Dänzer, dri-devel, Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 4426 bytes --]

Am 02.06.21 um 11:58 schrieb Marek Olšák:
> On Wed, Jun 2, 2021 at 5:44 AM Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     Am 02.06.21 um 10:57 schrieb Daniel Stone:
>     > Hi Christian,
>     >
>     > On Tue, 1 Jun 2021 at 13:51, Christian König
>     > <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>     >> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
>     >>> If you want to enable this use-case with driver magic and
>     without the
>     >>> compositor being aware of what's going on, the solution is
>     EGLStreams.
>     >>> Not sure we want to go there, but it's definitely a lot more
>     feasible
>     >>> than trying to stuff eglstreams semantics into dma-buf implicit
>     >>> fencing support in a desperate attempt to not change compositors.
>     >> Well not changing compositors is certainly not something I
>     would try
>     >> with this use case.
>     >>
>     >> Not changing compositors is more like ok we have Ubuntu 20.04
>     and need
>     >> to support that we the newest hardware generation.
>     > Serious question, have you talked to Canonical?
>     >
>     > I mean there's a hell of a lot of effort being expended here, but it
>     > seems to all be predicated on the assumption that Ubuntu's LTS
>     > HWE/backport policy is totally immutable, and that we might need to
>     > make the kernel do backflips to work around that. But ... is it? Has
>     > anyone actually asked them how they feel about this?
>
>     This was merely an example. What I wanted to say is that we need to
>     support system already deployed.
>
>     In other words our customers won't accept that they need to
>     replace the
>     compositor just because they switch to a new hardware generation.
>
>     > I mean, my answer to the first email is 'no, absolutely not'
>     from the
>     > technical perspective (the initial proposal totally breaks
>     current and
>     > future userspace), from a design perspective (it breaks a lot of
>     > usecases which aren't single-vendor GPU+display+codec, or aren't
>     just
>     > a simple desktop), and from a sustainability perspective (cutting
>     > Android adrift again isn't acceptable collateral damage to make it
>     > easier to backport things to last year's Ubuntu release).
>     >
>     > But then again, I don't even know what I'm NAKing here ... ? The
>     > original email just lists a proposal to break a ton of things, with
>     > proposed replacements which aren't technically viable, and it's not
>     > clear why? Can we please have some more details and some reasoning
>     > behind them?
>     >
>     > I don't mind that userspace (compositor, protocols, clients like
>     Mesa
>     > as well as codec APIs) need to do a lot of work to support the new
>     > model. I do really care though that the hard-binary-switch model
>     works
>     > fine enough for AMD but totally breaks heterogeneous systems and
>     makes
>     > it impossible for userspace to do the right thing.
>
>     Well how the handling for new Android, distributions etc... is
>     going to
>     look like is a completely different story.
>
>     And I completely agree with both Daniel Vetter and you that we
>     need to
>     keep this in mind when designing the compatibility with older
>     software.
>
>     For Android I'm really not sure what to do. In general Android is
>     already trying to do the right thing by using explicit sync, the
>     problem
>     is that this is build around the idea that this explicit sync is
>     syncfile kernel based.
>
>     Either we need to change Android and come up with something that
>     works
>     with user fences as well or we somehow invent a compatibility
>     layer for
>     syncfile as well.
>
>
> What's the issue with syncfiles that syncobjs don't suffer from?

Syncobjs where designed with future fences in mind. In other words we 
already have the ability to wait for a future submission to appear with 
all the nasty locking implications.

Syncfile on the other hand are just a container for up to N kernel 
fences and since we don't have kernel fences any more that is rather 
tricky to keep working.

Going to look into the uAPI around syncfiles once more and see if we can 
somehow use that for user fences as well.

Christian.

>
> Marek


[-- Attachment #2: Type: text/html, Size: 6835 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02  9:38                 ` Marek Olšák
@ 2021-06-02 18:48                   ` Daniel Vetter
  2021-06-02 18:52                     ` Christian König
  2021-06-03  3:16                     ` Marek Olšák
  0 siblings, 2 replies; 50+ messages in thread
From: Daniel Vetter @ 2021-06-02 18:48 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Jason Ekstrand, Christian König, Michel Dänzer,
	dri-devel, ML Mesa-dev

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> 
> > Yes, we can't break anything because we don't want to complicate things
> > for us. It's pretty much all NAK'd already. We are trying to gather more
> > knowledge and then make better decisions.
> >
> > The idea we are considering is that we'll expose memory-based sync objects
> > to userspace for read only, and the kernel or hw will strictly control the
> > memory writes to those sync objects. The hole in that idea is that
> > userspace can decide not to signal a job, so even if userspace can't
> > overwrite memory-based sync object states arbitrarily, it can still decide
> > not to signal them, and then a future fence is born.
> >
> 
> This would actually be treated as a GPU hang caused by that context, so it
> should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02 18:48                   ` Daniel Vetter
@ 2021-06-02 18:52                     ` Christian König
  2021-06-02 19:19                       ` Daniel Vetter
  2021-06-03  3:16                     ` Marek Olšák
  1 sibling, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-02 18:52 UTC (permalink / raw)
  To: Daniel Vetter, Marek Olšák
  Cc: Jason Ekstrand, ML Mesa-dev, Michel Dänzer, dri-devel



Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
>>
>>> Yes, we can't break anything because we don't want to complicate things
>>> for us. It's pretty much all NAK'd already. We are trying to gather more
>>> knowledge and then make better decisions.
>>>
>>> The idea we are considering is that we'll expose memory-based sync objects
>>> to userspace for read only, and the kernel or hw will strictly control the
>>> memory writes to those sync objects. The hole in that idea is that
>>> userspace can decide not to signal a job, so even if userspace can't
>>> overwrite memory-based sync object states arbitrarily, it can still decide
>>> not to signal them, and then a future fence is born.
>>>
>> This would actually be treated as a GPU hang caused by that context, so it
>> should be fine.
> This is practically what I proposed already, except your not doing it with
> dma_fence. And on the memory fence side this also doesn't actually give
> what you want for that compute model.
>
> This seems like a bit a worst of both worlds approach to me? Tons of work
> in the kernel to hide these not-dma_fence-but-almost, and still pain to
> actually drive the hardware like it should be for compute or direct
> display.
>
> Also maybe I've missed it, but I didn't see any replies to my suggestion
> how to fake the entire dma_fence stuff on top of new hw. Would be
> interesting to know what doesn't work there instead of amd folks going of
> into internal again and then coming back with another rfc from out of
> nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys 
that we need to keep kernel queues forever before going down that route.

That syncfile and all that Android stuff isn't working out of the box 
with the new shiny user queue submission model (which in turn is mostly 
because of Windows) already raised some eyebrows here.

Christian.

> -Daniel


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02 18:52                     ` Christian König
@ 2021-06-02 19:19                       ` Daniel Vetter
  2021-06-04  7:00                         ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-02 19:19 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> 
> 
> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > 
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > knowledge and then make better decisions.
> > > > 
> > > > The idea we are considering is that we'll expose memory-based sync objects
> > > > to userspace for read only, and the kernel or hw will strictly control the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still decide
> > > > not to signal them, and then a future fence is born.
> > > > 
> > > This would actually be treated as a GPU hang caused by that context, so it
> > > should be fine.
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> > 
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> > 
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> 
> Well to be honest I would just push back on our hardware/firmware guys that
> we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
  write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
  blindly trusts userspace to set up everything else, and even just wraps
  dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
  kernel shot the entire context and b) userspace recovers by setting up a
  new ringbuffer

- memory management is done using ttm only, you still need to supply the
  buffer list (ofc that list includes the always present ones, so CS will
  only get the list of special buffers like today). If you hw can't trun
  gpu page faults and you ever get one we pull up the same old solution:
  Kernel shots the entire context.

  The important thing is that from the gpu pov memory management works
  exactly like compute workload with direct submit, except that you just
  terminate the context on _any_ page fault, instead of only those that go
  somewhere where there's really no mapping and repair the others.

  Also I guess from reading the old thread this means you'd disable page
  fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
  userspace memory fences (represented as dma_fence) to complete, with
  maybe some fudge to busy-spin until we've reached the actual end of the
  ringbuffer (maybe you have a IB tail there after the memory fence write,
  we have that on intel hw), and it waits for the memory to get
  "unloaded". This is either preemption, or pte clearing + tlb shootdown,
  or whatever else your hw provides which is a) used for dynamic memory
  management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
  in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the
writes to memory fences, or where you need kernel ringbuffers, or anything
like thist. Userspace is allowed to do anything stupid, the rules are
guaranteed with:

- we rely on the hw isolation features to work, but _exactly_ like compute
  direct submit would too

- dying on any page fault captures memory management issues

- dying (without kernel recover, this is up to userspace if it cares) on
  any tdr makes sure fences complete still

> That syncfile and all that Android stuff isn't working out of the box with
> the new shiny user queue submission model (which in turn is mostly because
> of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't
break the _only_ option you have is provide a ctx mode that allows
dma_fence and drm/scheduler to be used like today.

For everything else it sounds you're a few years too late, because even
just huge kernel changes wont happen in time. Much less rewriting
userspace protocols.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Linux Graphics Next: Userspace submission update
  2021-06-02  8:09       ` Michel Dänzer
@ 2021-06-02 19:20         ` Daniel Vetter
  0 siblings, 0 replies; 50+ messages in thread
From: Daniel Vetter @ 2021-06-02 19:20 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Jason Ekstrand, Marek Olšák, Christian König,
	dri-devel, ML Mesa-dev

On Wed, Jun 02, 2021 at 10:09:01AM +0200, Michel Dänzer wrote:
> On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> > 
> >> Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
> > 
> > Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
> 
> E.g. a client might be able to sneak in a fence between when the
> compositor checks fences and when it submits its drawing to the kernel.

This is why implicit sync should be handled with explicit IPC. You pick
the fence up once, and then you need to tell your GL stack to _not_ do
implicit sync. Would need a new extension. vk afaiui does this
automatically already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02 18:48                   ` Daniel Vetter
  2021-06-02 18:52                     ` Christian König
@ 2021-06-03  3:16                     ` Marek Olšák
  2021-06-03  7:47                       ` Daniel Vetter
  1 sibling, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-03  3:16 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 3869 bytes --]

On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > > Yes, we can't break anything because we don't want to complicate things
> > > for us. It's pretty much all NAK'd already. We are trying to gather
> more
> > > knowledge and then make better decisions.
> > >
> > > The idea we are considering is that we'll expose memory-based sync
> objects
> > > to userspace for read only, and the kernel or hw will strictly control
> the
> > > memory writes to those sync objects. The hole in that idea is that
> > > userspace can decide not to signal a job, so even if userspace can't
> > > overwrite memory-based sync object states arbitrarily, it can still
> decide
> > > not to signal them, and then a future fence is born.
> > >
> >
> > This would actually be treated as a GPU hang caused by that context, so
> it
> > should be fine.
>
> This is practically what I proposed already, except your not doing it with
> dma_fence. And on the memory fence side this also doesn't actually give
> what you want for that compute model.
>
> This seems like a bit a worst of both worlds approach to me? Tons of work
> in the kernel to hide these not-dma_fence-but-almost, and still pain to
> actually drive the hardware like it should be for compute or direct
> display.
>
> Also maybe I've missed it, but I didn't see any replies to my suggestion
> how to fake the entire dma_fence stuff on top of new hw. Would be
> interesting to know what doesn't work there instead of amd folks going of
> into internal again and then coming back with another rfc from out of
> nowhere :-)
>

Going internal again is probably a good idea to spare you the long
discussions and not waste your time, but we haven't talked about the
dma_fence stuff internally other than acknowledging that it can be solved.

The compute use case already uses the hw as-is with no inter-process
sharing, which mostly keeps the kernel out of the picture. It uses glFinish
to sync with GL.

The gfx use case needs new hardware logic to support implicit and explicit
sync. When we propose a solution, it's usually torn apart the next day by
ourselves.

Since we are talking about next hw or next next hw, preemption should be
better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer via a
user queue, which waits for the per-buffer sequence counter in memory to be
>= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in memory
with atomic_inc regardless of the number assigned by the kernel. The kernel
counter and the counter in memory can be out-of-sync, and I'll explain why
it's OK. If a process increments the kernel counter but not the memory
counter, that's its problem and it's the same as a GPU hang caused by that
process. If a process increments the memory counter but not the kernel
counter, the ">=" condition alongside atomic_inc guarantee that signaling n
will signal n+1, so it will never deadlock but also it will effectively
disable synchronization. This method of disabling synchronization is
similar to a process corrupting the buffer, which should be fine. Can you
find any flaw in it? I can't find any.

The explicit submit can be done by userspace (if there is no
synchronization), but we plan to use the kernel to do it for implicit sync.
Essentially, the kernel will receive a buffer list and addresses of wait
commands in the user queue. It will assign new sequence numbers to all
buffers and write those numbers into the wait commands, and ring the hw
doorbell to start execution of that queue.

Marek

[-- Attachment #2: Type: text/html, Size: 4912 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03  3:16                     ` Marek Olšák
@ 2021-06-03  7:47                       ` Daniel Vetter
  2021-06-03  8:20                         ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-03  7:47 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Jason Ekstrand, Christian König, Michel Dänzer,
	dri-devel, ML Mesa-dev

On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > >
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > more
> > > > knowledge and then make better decisions.
> > > >
> > > > The idea we are considering is that we'll expose memory-based sync
> > objects
> > > > to userspace for read only, and the kernel or hw will strictly control
> > the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still
> > decide
> > > > not to signal them, and then a future fence is born.
> > > >
> > >
> > > This would actually be treated as a GPU hang caused by that context, so
> > it
> > > should be fine.
> >
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> >
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> >
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> >
> 
> Going internal again is probably a good idea to spare you the long
> discussions and not waste your time, but we haven't talked about the
> dma_fence stuff internally other than acknowledging that it can be solved.
> 
> The compute use case already uses the hw as-is with no inter-process
> sharing, which mostly keeps the kernel out of the picture. It uses glFinish
> to sync with GL.
> 
> The gfx use case needs new hardware logic to support implicit and explicit
> sync. When we propose a solution, it's usually torn apart the next day by
> ourselves.
> 
> Since we are talking about next hw or next next hw, preemption should be
> better.
> 
> user queue = user-mapped ring buffer
> 
> For implicit sync, we will only let userspace lock access to a buffer via a
> user queue, which waits for the per-buffer sequence counter in memory to be
> >= the number assigned by the kernel, and later unlock the access with
> another command, which increments the per-buffer sequence counter in memory
> with atomic_inc regardless of the number assigned by the kernel. The kernel
> counter and the counter in memory can be out-of-sync, and I'll explain why
> it's OK. If a process increments the kernel counter but not the memory
> counter, that's its problem and it's the same as a GPU hang caused by that
> process. If a process increments the memory counter but not the kernel
> counter, the ">=" condition alongside atomic_inc guarantee that signaling n
> will signal n+1, so it will never deadlock but also it will effectively
> disable synchronization. This method of disabling synchronization is
> similar to a process corrupting the buffer, which should be fine. Can you
> find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
of "we let userspace free-wheel whatever it wants, kernel ensures
correctness of the resulting chain of dma_fence with reset the entire
context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

> The explicit submit can be done by userspace (if there is no
> synchronization), but we plan to use the kernel to do it for implicit sync.
> Essentially, the kernel will receive a buffer list and addresses of wait
> commands in the user queue. It will assign new sequence numbers to all
> buffers and write those numbers into the wait commands, and ring the hw
> doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out
the dma_fence dependencies is probably best. Since you can filter out
which dma_fence you hand to the scheduler for dependency tracking you can
filter out your own ones and let the hw handle those directly (depending
how much your hw can do an all that). On i915 we might do that to be able
to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to
still keep them wrapped as dma_fence and stuffed into dma_resv, but
conceptually it's the same. If we let every driver reinvent their own
buffer tracking just because the hw works a bit different it'll be a mess.

Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets
to keep the pieces if it gets it wrong. You do still need to handle
external dma_fence though, hence drm/scheduler frontend to sort these out.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03  7:47                       ` Daniel Vetter
@ 2021-06-03  8:20                         ` Marek Olšák
  2021-06-03 10:03                           ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-03  8:20 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 6583 bytes --]

On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > >
> > > > > Yes, we can't break anything because we don't want to complicate
> things
> > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > more
> > > > > knowledge and then make better decisions.
> > > > >
> > > > > The idea we are considering is that we'll expose memory-based sync
> > > objects
> > > > > to userspace for read only, and the kernel or hw will strictly
> control
> > > the
> > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > userspace can decide not to signal a job, so even if userspace
> can't
> > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > decide
> > > > > not to signal them, and then a future fence is born.
> > > > >
> > > >
> > > > This would actually be treated as a GPU hang caused by that context,
> so
> > > it
> > > > should be fine.
> > >
> > > This is practically what I proposed already, except your not doing it
> with
> > > dma_fence. And on the memory fence side this also doesn't actually give
> > > what you want for that compute model.
> > >
> > > This seems like a bit a worst of both worlds approach to me? Tons of
> work
> > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > actually drive the hardware like it should be for compute or direct
> > > display.
> > >
> > > Also maybe I've missed it, but I didn't see any replies to my
> suggestion
> > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > interesting to know what doesn't work there instead of amd folks going
> of
> > > into internal again and then coming back with another rfc from out of
> > > nowhere :-)
> > >
> >
> > Going internal again is probably a good idea to spare you the long
> > discussions and not waste your time, but we haven't talked about the
> > dma_fence stuff internally other than acknowledging that it can be
> solved.
> >
> > The compute use case already uses the hw as-is with no inter-process
> > sharing, which mostly keeps the kernel out of the picture. It uses
> glFinish
> > to sync with GL.
> >
> > The gfx use case needs new hardware logic to support implicit and
> explicit
> > sync. When we propose a solution, it's usually torn apart the next day by
> > ourselves.
> >
> > Since we are talking about next hw or next next hw, preemption should be
> > better.
> >
> > user queue = user-mapped ring buffer
> >
> > For implicit sync, we will only let userspace lock access to a buffer
> via a
> > user queue, which waits for the per-buffer sequence counter in memory to
> be
> > >= the number assigned by the kernel, and later unlock the access with
> > another command, which increments the per-buffer sequence counter in
> memory
> > with atomic_inc regardless of the number assigned by the kernel. The
> kernel
> > counter and the counter in memory can be out-of-sync, and I'll explain
> why
> > it's OK. If a process increments the kernel counter but not the memory
> > counter, that's its problem and it's the same as a GPU hang caused by
> that
> > process. If a process increments the memory counter but not the kernel
> > counter, the ">=" condition alongside atomic_inc guarantee that
> signaling n
> > will signal n+1, so it will never deadlock but also it will effectively
> > disable synchronization. This method of disabling synchronization is
> > similar to a process corrupting the buffer, which should be fine. Can you
> > find any flaw in it? I can't find any.
>
> Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
> of "we let userspace free-wheel whatever it wants, kernel ensures
> correctness of the resulting chain of dma_fence with reset the entire
> context" is what I proposed too.
>
> Like you say, userspace is allowed to render garbage already.
>
> > The explicit submit can be done by userspace (if there is no
> > synchronization), but we plan to use the kernel to do it for implicit
> sync.
> > Essentially, the kernel will receive a buffer list and addresses of wait
> > commands in the user queue. It will assign new sequence numbers to all
> > buffers and write those numbers into the wait commands, and ring the hw
> > doorbell to start execution of that queue.
>
> Yeah for implicit sync I think kernel and using drm/scheduler to sort out
> the dma_fence dependencies is probably best. Since you can filter out
> which dma_fence you hand to the scheduler for dependency tracking you can
> filter out your own ones and let the hw handle those directly (depending
> how much your hw can do an all that). On i915 we might do that to be able
> to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
>
> For buffer tracking with implicit sync I think cleanest is probably to
> still keep them wrapped as dma_fence and stuffed into dma_resv, but
> conceptually it's the same. If we let every driver reinvent their own
> buffer tracking just because the hw works a bit different it'll be a mess.
>
> Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets
> to keep the pieces if it gets it wrong. You do still need to handle
> external dma_fence though, hence drm/scheduler frontend to sort these out.
>

The reason is to disallow lower-privileged process to deadlock/hang a
higher-privileged process where the kernel can't tell who did it. If the
implicit-sync sequence counter is read only to userspace and only
incrementable by the unlock-signal command after the lock-wait command
appeared in the same queue (both together forming a critical section),
userspace can't manipulate it arbitrarily and we get almost the exact same
behavior as implicit sync has today. That means any implicitly-sync'd
buffer from any process can be fully trusted by a compositor to signal in a
finite time, and possibly even trusted by the kernel. The only thing that's
different is that a malicious process can disable implicit sync for a
buffer in all processes/kernel, but it can't hang other processes/kernel
(it can only hang itself and the kernel will be notified). So I'm a happy
panda now. :)

Marek

[-- Attachment #2: Type: text/html, Size: 7774 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03  8:20                         ` Marek Olšák
@ 2021-06-03 10:03                           ` Daniel Vetter
  2021-06-03 10:55                             ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-03 10:03 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Jason Ekstrand, Christian König, Michel Dänzer,
	dri-devel, ML Mesa-dev

On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > > >
> > > > > > Yes, we can't break anything because we don't want to complicate
> > things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > > more
> > > > > > knowledge and then make better decisions.
> > > > > >
> > > > > > The idea we are considering is that we'll expose memory-based sync
> > > > objects
> > > > > > to userspace for read only, and the kernel or hw will strictly
> > control
> > > > the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace
> > can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > > decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > >
> > > > >
> > > > > This would actually be treated as a GPU hang caused by that context,
> > so
> > > > it
> > > > > should be fine.
> > > >
> > > > This is practically what I proposed already, except your not doing it
> > with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > >
> > > > This seems like a bit a worst of both worlds approach to me? Tons of
> > work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > >
> > > > Also maybe I've missed it, but I didn't see any replies to my
> > suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going
> > of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > >
> > >
> > > Going internal again is probably a good idea to spare you the long
> > > discussions and not waste your time, but we haven't talked about the
> > > dma_fence stuff internally other than acknowledging that it can be
> > solved.
> > >
> > > The compute use case already uses the hw as-is with no inter-process
> > > sharing, which mostly keeps the kernel out of the picture. It uses
> > glFinish
> > > to sync with GL.
> > >
> > > The gfx use case needs new hardware logic to support implicit and
> > explicit
> > > sync. When we propose a solution, it's usually torn apart the next day by
> > > ourselves.
> > >
> > > Since we are talking about next hw or next next hw, preemption should be
> > > better.
> > >
> > > user queue = user-mapped ring buffer
> > >
> > > For implicit sync, we will only let userspace lock access to a buffer
> > via a
> > > user queue, which waits for the per-buffer sequence counter in memory to
> > be
> > > >= the number assigned by the kernel, and later unlock the access with
> > > another command, which increments the per-buffer sequence counter in
> > memory
> > > with atomic_inc regardless of the number assigned by the kernel. The
> > kernel
> > > counter and the counter in memory can be out-of-sync, and I'll explain
> > why
> > > it's OK. If a process increments the kernel counter but not the memory
> > > counter, that's its problem and it's the same as a GPU hang caused by
> > that
> > > process. If a process increments the memory counter but not the kernel
> > > counter, the ">=" condition alongside atomic_inc guarantee that
> > signaling n
> > > will signal n+1, so it will never deadlock but also it will effectively
> > > disable synchronization. This method of disabling synchronization is
> > > similar to a process corrupting the buffer, which should be fine. Can you
> > > find any flaw in it? I can't find any.
> >
> > Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
> > of "we let userspace free-wheel whatever it wants, kernel ensures
> > correctness of the resulting chain of dma_fence with reset the entire
> > context" is what I proposed too.
> >
> > Like you say, userspace is allowed to render garbage already.
> >
> > > The explicit submit can be done by userspace (if there is no
> > > synchronization), but we plan to use the kernel to do it for implicit
> > sync.
> > > Essentially, the kernel will receive a buffer list and addresses of wait
> > > commands in the user queue. It will assign new sequence numbers to all
> > > buffers and write those numbers into the wait commands, and ring the hw
> > > doorbell to start execution of that queue.
> >
> > Yeah for implicit sync I think kernel and using drm/scheduler to sort out
> > the dma_fence dependencies is probably best. Since you can filter out
> > which dma_fence you hand to the scheduler for dependency tracking you can
> > filter out your own ones and let the hw handle those directly (depending
> > how much your hw can do an all that). On i915 we might do that to be able
> > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
> >
> > For buffer tracking with implicit sync I think cleanest is probably to
> > still keep them wrapped as dma_fence and stuffed into dma_resv, but
> > conceptually it's the same. If we let every driver reinvent their own
> > buffer tracking just because the hw works a bit different it'll be a mess.
> >
> > Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets
> > to keep the pieces if it gets it wrong. You do still need to handle
> > external dma_fence though, hence drm/scheduler frontend to sort these out.
> >
> 
> The reason is to disallow lower-privileged process to deadlock/hang a
> higher-privileged process where the kernel can't tell who did it. If the
> implicit-sync sequence counter is read only to userspace and only
> incrementable by the unlock-signal command after the lock-wait command
> appeared in the same queue (both together forming a critical section),
> userspace can't manipulate it arbitrarily and we get almost the exact same
> behavior as implicit sync has today. That means any implicitly-sync'd
> buffer from any process can be fully trusted by a compositor to signal in a
> finite time, and possibly even trusted by the kernel. The only thing that's
> different is that a malicious process can disable implicit sync for a
> buffer in all processes/kernel, but it can't hang other processes/kernel
> (it can only hang itself and the kernel will be notified). So I'm a happy
> panda now. :)

Yeah I think that's not going to work too well, and is too many piled up
hacks. Within a drm_file fd you can do whatever you feel like, since it's
just one client.

But once implicit sync kicks in I think you need to go with dma_fence and
drm/scheduler to handle the dependencies, and tdr kicking it. With the
dma_fence you do know who's the offender - you might not know why, but
that doesn't matter, you just shred the entire context and let that
userspace figure out the details.

I think trying to make memory fences work as implicit sync directly,
without wrapping them in a dma_fence and assorted guarantees, will just
not work.

And once you do wrap them in dma_fence, then all the other problems go
away: cross-driver sync, syncfiles, ... So I really don't see the benefit
of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already there in
the current model. And it can't hurt to have a bit of motivation for
compositors to switch over to userspace memory fences properly.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03 10:03                           ` Daniel Vetter
@ 2021-06-03 10:55                             ` Marek Olšák
  2021-06-03 11:22                               ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-03 10:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 9610 bytes --]

On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:

> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch>
> wrote:
> > > >
> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com>
> wrote:
> > > > > >
> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> > > things
> > > > > > > for us. It's pretty much all NAK'd already. We are trying to
> gather
> > > > > more
> > > > > > > knowledge and then make better decisions.
> > > > > > >
> > > > > > > The idea we are considering is that we'll expose memory-based
> sync
> > > > > objects
> > > > > > > to userspace for read only, and the kernel or hw will strictly
> > > control
> > > > > the
> > > > > > > memory writes to those sync objects. The hole in that idea is
> that
> > > > > > > userspace can decide not to signal a job, so even if userspace
> > > can't
> > > > > > > overwrite memory-based sync object states arbitrarily, it can
> still
> > > > > decide
> > > > > > > not to signal them, and then a future fence is born.
> > > > > > >
> > > > > >
> > > > > > This would actually be treated as a GPU hang caused by that
> context,
> > > so
> > > > > it
> > > > > > should be fine.
> > > > >
> > > > > This is practically what I proposed already, except your not doing
> it
> > > with
> > > > > dma_fence. And on the memory fence side this also doesn't actually
> give
> > > > > what you want for that compute model.
> > > > >
> > > > > This seems like a bit a worst of both worlds approach to me? Tons
> of
> > > work
> > > > > in the kernel to hide these not-dma_fence-but-almost, and still
> pain to
> > > > > actually drive the hardware like it should be for compute or direct
> > > > > display.
> > > > >
> > > > > Also maybe I've missed it, but I didn't see any replies to my
> > > suggestion
> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > interesting to know what doesn't work there instead of amd folks
> going
> > > of
> > > > > into internal again and then coming back with another rfc from out
> of
> > > > > nowhere :-)
> > > > >
> > > >
> > > > Going internal again is probably a good idea to spare you the long
> > > > discussions and not waste your time, but we haven't talked about the
> > > > dma_fence stuff internally other than acknowledging that it can be
> > > solved.
> > > >
> > > > The compute use case already uses the hw as-is with no inter-process
> > > > sharing, which mostly keeps the kernel out of the picture. It uses
> > > glFinish
> > > > to sync with GL.
> > > >
> > > > The gfx use case needs new hardware logic to support implicit and
> > > explicit
> > > > sync. When we propose a solution, it's usually torn apart the next
> day by
> > > > ourselves.
> > > >
> > > > Since we are talking about next hw or next next hw, preemption
> should be
> > > > better.
> > > >
> > > > user queue = user-mapped ring buffer
> > > >
> > > > For implicit sync, we will only let userspace lock access to a buffer
> > > via a
> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> > > be
> > > > >= the number assigned by the kernel, and later unlock the access
> with
> > > > another command, which increments the per-buffer sequence counter in
> > > memory
> > > > with atomic_inc regardless of the number assigned by the kernel. The
> > > kernel
> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> > > why
> > > > it's OK. If a process increments the kernel counter but not the
> memory
> > > > counter, that's its problem and it's the same as a GPU hang caused by
> > > that
> > > > process. If a process increments the memory counter but not the
> kernel
> > > > counter, the ">=" condition alongside atomic_inc guarantee that
> > > signaling n
> > > > will signal n+1, so it will never deadlock but also it will
> effectively
> > > > disable synchronization. This method of disabling synchronization is
> > > > similar to a process corrupting the buffer, which should be fine.
> Can you
> > > > find any flaw in it? I can't find any.
> > >
> > > Hm maybe I misunderstood what exactly you wanted to do earlier. That
> kind
> > > of "we let userspace free-wheel whatever it wants, kernel ensures
> > > correctness of the resulting chain of dma_fence with reset the entire
> > > context" is what I proposed too.
> > >
> > > Like you say, userspace is allowed to render garbage already.
> > >
> > > > The explicit submit can be done by userspace (if there is no
> > > > synchronization), but we plan to use the kernel to do it for implicit
> > > sync.
> > > > Essentially, the kernel will receive a buffer list and addresses of
> wait
> > > > commands in the user queue. It will assign new sequence numbers to
> all
> > > > buffers and write those numbers into the wait commands, and ring the
> hw
> > > > doorbell to start execution of that queue.
> > >
> > > Yeah for implicit sync I think kernel and using drm/scheduler to sort
> out
> > > the dma_fence dependencies is probably best. Since you can filter out
> > > which dma_fence you hand to the scheduler for dependency tracking you
> can
> > > filter out your own ones and let the hw handle those directly
> (depending
> > > how much your hw can do an all that). On i915 we might do that to be
> able
> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
> scheduler.
> > >
> > > For buffer tracking with implicit sync I think cleanest is probably to
> > > still keep them wrapped as dma_fence and stuffed into dma_resv, but
> > > conceptually it's the same. If we let every driver reinvent their own
> > > buffer tracking just because the hw works a bit different it'll be a
> mess.
> > >
> > > Wrt wait commands: I'm honestly not sure why you'd do that. Userspace
> gets
> > > to keep the pieces if it gets it wrong. You do still need to handle
> > > external dma_fence though, hence drm/scheduler frontend to sort these
> out.
> > >
> >
> > The reason is to disallow lower-privileged process to deadlock/hang a
> > higher-privileged process where the kernel can't tell who did it. If the
> > implicit-sync sequence counter is read only to userspace and only
> > incrementable by the unlock-signal command after the lock-wait command
> > appeared in the same queue (both together forming a critical section),
> > userspace can't manipulate it arbitrarily and we get almost the exact
> same
> > behavior as implicit sync has today. That means any implicitly-sync'd
> > buffer from any process can be fully trusted by a compositor to signal
> in a
> > finite time, and possibly even trusted by the kernel. The only thing
> that's
> > different is that a malicious process can disable implicit sync for a
> > buffer in all processes/kernel, but it can't hang other processes/kernel
> > (it can only hang itself and the kernel will be notified). So I'm a happy
> > panda now. :)
>
> Yeah I think that's not going to work too well, and is too many piled up
> hacks. Within a drm_file fd you can do whatever you feel like, since it's
> just one client.
>
> But once implicit sync kicks in I think you need to go with dma_fence and
> drm/scheduler to handle the dependencies, and tdr kicking it. With the
> dma_fence you do know who's the offender - you might not know why, but
> that doesn't matter, you just shred the entire context and let that
> userspace figure out the details.
>
> I think trying to make memory fences work as implicit sync directly,
> without wrapping them in a dma_fence and assorted guarantees, will just
> not work.
>
> And once you do wrap them in dma_fence, then all the other problems go
> away: cross-driver sync, syncfiles, ... So I really don't see the benefit
> of this half-way approach.
>
> Yes there's going to be a tad bit of overhead, but that's already there in
> the current model. And it can't hurt to have a bit of motivation for
> compositors to switch over to userspace memory fences properly.
>

Well, Christian thinks that we need a high level synchronization primitive
in hw. I don't know myself and you may be right. A software scheduler with
user queues might be one option. My part is only to find out how much of
the scheduler logic can be moved to the hardware.

We plan to have memory timeline semaphores, or simply monotonic counters,
and a fence will be represented by the counter address and a constant
sequence number for the <= comparison. One counter can represent up to 2^64
different fences. Giving any process write access to a fence is the same as
giving it the power to manipulate the signalled state of a sequence of up
to 2^64 fences. That could mess up a lot of things. However, if the
hardware had a high level synchronization primitive with access rights and
a limited set of clearly defined operations such that we can formally prove
whether it's safe for everybody, we could have a solution where we don't
have to involve the software scheduler and just let the hardware do
everything.

Marek



-Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #2: Type: text/html, Size: 12161 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03 10:55                             ` Marek Olšák
@ 2021-06-03 11:22                               ` Daniel Vetter
  2021-06-03 17:52                                 ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-03 11:22 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:
>>
>> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
>> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> >
>> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
>> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> > > >
>> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
>> > > > > >
>> > > > > > > Yes, we can't break anything because we don't want to complicate
>> > > things
>> > > > > > > for us. It's pretty much all NAK'd already. We are trying to gather
>> > > > > more
>> > > > > > > knowledge and then make better decisions.
>> > > > > > >
>> > > > > > > The idea we are considering is that we'll expose memory-based sync
>> > > > > objects
>> > > > > > > to userspace for read only, and the kernel or hw will strictly
>> > > control
>> > > > > the
>> > > > > > > memory writes to those sync objects. The hole in that idea is that
>> > > > > > > userspace can decide not to signal a job, so even if userspace
>> > > can't
>> > > > > > > overwrite memory-based sync object states arbitrarily, it can still
>> > > > > decide
>> > > > > > > not to signal them, and then a future fence is born.
>> > > > > > >
>> > > > > >
>> > > > > > This would actually be treated as a GPU hang caused by that context,
>> > > so
>> > > > > it
>> > > > > > should be fine.
>> > > > >
>> > > > > This is practically what I proposed already, except your not doing it
>> > > with
>> > > > > dma_fence. And on the memory fence side this also doesn't actually give
>> > > > > what you want for that compute model.
>> > > > >
>> > > > > This seems like a bit a worst of both worlds approach to me? Tons of
>> > > work
>> > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
>> > > > > actually drive the hardware like it should be for compute or direct
>> > > > > display.
>> > > > >
>> > > > > Also maybe I've missed it, but I didn't see any replies to my
>> > > suggestion
>> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
>> > > > > interesting to know what doesn't work there instead of amd folks going
>> > > of
>> > > > > into internal again and then coming back with another rfc from out of
>> > > > > nowhere :-)
>> > > > >
>> > > >
>> > > > Going internal again is probably a good idea to spare you the long
>> > > > discussions and not waste your time, but we haven't talked about the
>> > > > dma_fence stuff internally other than acknowledging that it can be
>> > > solved.
>> > > >
>> > > > The compute use case already uses the hw as-is with no inter-process
>> > > > sharing, which mostly keeps the kernel out of the picture. It uses
>> > > glFinish
>> > > > to sync with GL.
>> > > >
>> > > > The gfx use case needs new hardware logic to support implicit and
>> > > explicit
>> > > > sync. When we propose a solution, it's usually torn apart the next day by
>> > > > ourselves.
>> > > >
>> > > > Since we are talking about next hw or next next hw, preemption should be
>> > > > better.
>> > > >
>> > > > user queue = user-mapped ring buffer
>> > > >
>> > > > For implicit sync, we will only let userspace lock access to a buffer
>> > > via a
>> > > > user queue, which waits for the per-buffer sequence counter in memory to
>> > > be
>> > > > >= the number assigned by the kernel, and later unlock the access with
>> > > > another command, which increments the per-buffer sequence counter in
>> > > memory
>> > > > with atomic_inc regardless of the number assigned by the kernel. The
>> > > kernel
>> > > > counter and the counter in memory can be out-of-sync, and I'll explain
>> > > why
>> > > > it's OK. If a process increments the kernel counter but not the memory
>> > > > counter, that's its problem and it's the same as a GPU hang caused by
>> > > that
>> > > > process. If a process increments the memory counter but not the kernel
>> > > > counter, the ">=" condition alongside atomic_inc guarantee that
>> > > signaling n
>> > > > will signal n+1, so it will never deadlock but also it will effectively
>> > > > disable synchronization. This method of disabling synchronization is
>> > > > similar to a process corrupting the buffer, which should be fine. Can you
>> > > > find any flaw in it? I can't find any.
>> > >
>> > > Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
>> > > of "we let userspace free-wheel whatever it wants, kernel ensures
>> > > correctness of the resulting chain of dma_fence with reset the entire
>> > > context" is what I proposed too.
>> > >
>> > > Like you say, userspace is allowed to render garbage already.
>> > >
>> > > > The explicit submit can be done by userspace (if there is no
>> > > > synchronization), but we plan to use the kernel to do it for implicit
>> > > sync.
>> > > > Essentially, the kernel will receive a buffer list and addresses of wait
>> > > > commands in the user queue. It will assign new sequence numbers to all
>> > > > buffers and write those numbers into the wait commands, and ring the hw
>> > > > doorbell to start execution of that queue.
>> > >
>> > > Yeah for implicit sync I think kernel and using drm/scheduler to sort out
>> > > the dma_fence dependencies is probably best. Since you can filter out
>> > > which dma_fence you hand to the scheduler for dependency tracking you can
>> > > filter out your own ones and let the hw handle those directly (depending
>> > > how much your hw can do an all that). On i915 we might do that to be able
>> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
>> > >
>> > > For buffer tracking with implicit sync I think cleanest is probably to
>> > > still keep them wrapped as dma_fence and stuffed into dma_resv, but
>> > > conceptually it's the same. If we let every driver reinvent their own
>> > > buffer tracking just because the hw works a bit different it'll be a mess.
>> > >
>> > > Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets
>> > > to keep the pieces if it gets it wrong. You do still need to handle
>> > > external dma_fence though, hence drm/scheduler frontend to sort these out.
>> > >
>> >
>> > The reason is to disallow lower-privileged process to deadlock/hang a
>> > higher-privileged process where the kernel can't tell who did it. If the
>> > implicit-sync sequence counter is read only to userspace and only
>> > incrementable by the unlock-signal command after the lock-wait command
>> > appeared in the same queue (both together forming a critical section),
>> > userspace can't manipulate it arbitrarily and we get almost the exact same
>> > behavior as implicit sync has today. That means any implicitly-sync'd
>> > buffer from any process can be fully trusted by a compositor to signal in a
>> > finite time, and possibly even trusted by the kernel. The only thing that's
>> > different is that a malicious process can disable implicit sync for a
>> > buffer in all processes/kernel, but it can't hang other processes/kernel
>> > (it can only hang itself and the kernel will be notified). So I'm a happy
>> > panda now. :)
>>
>> Yeah I think that's not going to work too well, and is too many piled up
>> hacks. Within a drm_file fd you can do whatever you feel like, since it's
>> just one client.
>>
>> But once implicit sync kicks in I think you need to go with dma_fence and
>> drm/scheduler to handle the dependencies, and tdr kicking it. With the
>> dma_fence you do know who's the offender - you might not know why, but
>> that doesn't matter, you just shred the entire context and let that
>> userspace figure out the details.
>>
>> I think trying to make memory fences work as implicit sync directly,
>> without wrapping them in a dma_fence and assorted guarantees, will just
>> not work.
>>
>> And once you do wrap them in dma_fence, then all the other problems go
>> away: cross-driver sync, syncfiles, ... So I really don't see the benefit
>> of this half-way approach.
>>
>> Yes there's going to be a tad bit of overhead, but that's already there in
>> the current model. And it can't hurt to have a bit of motivation for
>> compositors to switch over to userspace memory fences properly.
>
>
> Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
>
> We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.

I don't think hw access rights control on memory fences makes sense.
There's two cases:

- brave new world of native userspace memory fences. Currently that's
compute, maybe direct display vk, hopefully/eventually compositors and
desktops too. If you get an untrusted fence, you need to have fallback
logic no matter what, and by design. vk is explicit in stating that if
things hang, you get to keep all the pieces. So the compositor needs
to _always_ treat userspace memory fences as hostile, wrap them in a
timeout, and have a fallback frame/scene in its rendering path.
Probably same for the kernel on display side, maybe down to the
display hw picking the "right" frame depending upon the fence value
right before scanout as we discussed earlier. There's no point in hw
access rights because by design, even if no one tampers with your
fence, it might never signal. So you have to cope with a hostile fence
from untrusted sources anyway (and within an app it's trusted and you
just die as in stuck in an endless loop until someon sends a SIGKILL
when you deadlock or get it wrong some other way).

- old compat mode where we need to use dma_fence, otherwise we end up
with another round of "amdgpu redefines implicit sync in incompatible
ways", and Christian&me don't even know yet how to fix the current
round without breaking use-cases badly yet. So it has to be dma_fence,
and it has to be the same rules as on old hw, or it's just not going
to work. This means you need to force in-order retiring of fences in
the kernel, and you need to enforce timeout. None of this needs hw
access rights control, since once more it's just software constructs
in the kernel. As a first appromixation, drm/scheduler + the fence
chain we already have in syncobj is probably good enough for this.
E.g. if userspace rolls the fence backwards then the kernel just
ignores that, because its internal dma_fence has signalled, and
dma_fences never unsignal (it's a bit in struct dma_fence, once it's
set we stop checking hw). And if it doesn't signal in time, then tdr
jumps in and fixes the mess. Hw access control doesn't fix anything
here, because you have to deal with timeout and ordering already
anyway, or the dma_fence contract is broken.

So in both cases hw access control gains you nothing (at least I'm not
seeing anything), it just makes the design more tricky. "Userspace can
manipulate the fences" is _intentionally_ how these things work, we
need a design that works with that hw design, not against it and
somehow tries to get us back to the old world, but only halfway (i.e.
not useful at all, since old userspace needs us to go all the way back
to dma_fence, and new userspace wants to fully exploit userspace
memory fences without artificial limitations for no reason).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03 11:22                               ` Daniel Vetter
@ 2021-06-03 17:52                                 ` Marek Olšák
  2021-06-03 19:18                                   ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-03 17:52 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 13648 bytes --]

Daniel, I think what you are suggesting is that we need to enable user
queues with the drm scheduler and dma_fence first, and once that works, we
can investigate how much of that kernel logic can be moved to the hw. Would
that work? In theory it shouldn't matter whether the kernel does it or the
hw does it. It's the same code, just in a different place.

Thanks,
Marek

On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:
> >>
> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> >
> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch>
> wrote:
> >> > > >
> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com>
> wrote:
> >> > > > > >
> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> > > things
> >> > > > > > > for us. It's pretty much all NAK'd already. We are trying
> to gather
> >> > > > > more
> >> > > > > > > knowledge and then make better decisions.
> >> > > > > > >
> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> > > > > objects
> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> > > control
> >> > > > > the
> >> > > > > > > memory writes to those sync objects. The hole in that idea
> is that
> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> > > can't
> >> > > > > > > overwrite memory-based sync object states arbitrarily, it
> can still
> >> > > > > decide
> >> > > > > > > not to signal them, and then a future fence is born.
> >> > > > > > >
> >> > > > > >
> >> > > > > > This would actually be treated as a GPU hang caused by that
> context,
> >> > > so
> >> > > > > it
> >> > > > > > should be fine.
> >> > > > >
> >> > > > > This is practically what I proposed already, except your not
> doing it
> >> > > with
> >> > > > > dma_fence. And on the memory fence side this also doesn't
> actually give
> >> > > > > what you want for that compute model.
> >> > > > >
> >> > > > > This seems like a bit a worst of both worlds approach to me?
> Tons of
> >> > > work
> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and still
> pain to
> >> > > > > actually drive the hardware like it should be for compute or
> direct
> >> > > > > display.
> >> > > > >
> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
> >> > > suggestion
> >> > > > > how to fake the entire dma_fence stuff on top of new hw. Would
> be
> >> > > > > interesting to know what doesn't work there instead of amd
> folks going
> >> > > of
> >> > > > > into internal again and then coming back with another rfc from
> out of
> >> > > > > nowhere :-)
> >> > > > >
> >> > > >
> >> > > > Going internal again is probably a good idea to spare you the long
> >> > > > discussions and not waste your time, but we haven't talked about
> the
> >> > > > dma_fence stuff internally other than acknowledging that it can be
> >> > > solved.
> >> > > >
> >> > > > The compute use case already uses the hw as-is with no
> inter-process
> >> > > > sharing, which mostly keeps the kernel out of the picture. It uses
> >> > > glFinish
> >> > > > to sync with GL.
> >> > > >
> >> > > > The gfx use case needs new hardware logic to support implicit and
> >> > > explicit
> >> > > > sync. When we propose a solution, it's usually torn apart the
> next day by
> >> > > > ourselves.
> >> > > >
> >> > > > Since we are talking about next hw or next next hw, preemption
> should be
> >> > > > better.
> >> > > >
> >> > > > user queue = user-mapped ring buffer
> >> > > >
> >> > > > For implicit sync, we will only let userspace lock access to a
> buffer
> >> > > via a
> >> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> >> > > be
> >> > > > >= the number assigned by the kernel, and later unlock the access
> with
> >> > > > another command, which increments the per-buffer sequence counter
> in
> >> > > memory
> >> > > > with atomic_inc regardless of the number assigned by the kernel.
> The
> >> > > kernel
> >> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> >> > > why
> >> > > > it's OK. If a process increments the kernel counter but not the
> memory
> >> > > > counter, that's its problem and it's the same as a GPU hang
> caused by
> >> > > that
> >> > > > process. If a process increments the memory counter but not the
> kernel
> >> > > > counter, the ">=" condition alongside atomic_inc guarantee that
> >> > > signaling n
> >> > > > will signal n+1, so it will never deadlock but also it will
> effectively
> >> > > > disable synchronization. This method of disabling synchronization
> is
> >> > > > similar to a process corrupting the buffer, which should be fine.
> Can you
> >> > > > find any flaw in it? I can't find any.
> >> > >
> >> > > Hm maybe I misunderstood what exactly you wanted to do earlier.
> That kind
> >> > > of "we let userspace free-wheel whatever it wants, kernel ensures
> >> > > correctness of the resulting chain of dma_fence with reset the
> entire
> >> > > context" is what I proposed too.
> >> > >
> >> > > Like you say, userspace is allowed to render garbage already.
> >> > >
> >> > > > The explicit submit can be done by userspace (if there is no
> >> > > > synchronization), but we plan to use the kernel to do it for
> implicit
> >> > > sync.
> >> > > > Essentially, the kernel will receive a buffer list and addresses
> of wait
> >> > > > commands in the user queue. It will assign new sequence numbers
> to all
> >> > > > buffers and write those numbers into the wait commands, and ring
> the hw
> >> > > > doorbell to start execution of that queue.
> >> > >
> >> > > Yeah for implicit sync I think kernel and using drm/scheduler to
> sort out
> >> > > the dma_fence dependencies is probably best. Since you can filter
> out
> >> > > which dma_fence you hand to the scheduler for dependency tracking
> you can
> >> > > filter out your own ones and let the hw handle those directly
> (depending
> >> > > how much your hw can do an all that). On i915 we might do that to
> be able
> >> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
> scheduler.
> >> > >
> >> > > For buffer tracking with implicit sync I think cleanest is probably
> to
> >> > > still keep them wrapped as dma_fence and stuffed into dma_resv, but
> >> > > conceptually it's the same. If we let every driver reinvent their
> own
> >> > > buffer tracking just because the hw works a bit different it'll be
> a mess.
> >> > >
> >> > > Wrt wait commands: I'm honestly not sure why you'd do that.
> Userspace gets
> >> > > to keep the pieces if it gets it wrong. You do still need to handle
> >> > > external dma_fence though, hence drm/scheduler frontend to sort
> these out.
> >> > >
> >> >
> >> > The reason is to disallow lower-privileged process to deadlock/hang a
> >> > higher-privileged process where the kernel can't tell who did it. If
> the
> >> > implicit-sync sequence counter is read only to userspace and only
> >> > incrementable by the unlock-signal command after the lock-wait command
> >> > appeared in the same queue (both together forming a critical section),
> >> > userspace can't manipulate it arbitrarily and we get almost the exact
> same
> >> > behavior as implicit sync has today. That means any implicitly-sync'd
> >> > buffer from any process can be fully trusted by a compositor to
> signal in a
> >> > finite time, and possibly even trusted by the kernel. The only thing
> that's
> >> > different is that a malicious process can disable implicit sync for a
> >> > buffer in all processes/kernel, but it can't hang other
> processes/kernel
> >> > (it can only hang itself and the kernel will be notified). So I'm a
> happy
> >> > panda now. :)
> >>
> >> Yeah I think that's not going to work too well, and is too many piled up
> >> hacks. Within a drm_file fd you can do whatever you feel like, since
> it's
> >> just one client.
> >>
> >> But once implicit sync kicks in I think you need to go with dma_fence
> and
> >> drm/scheduler to handle the dependencies, and tdr kicking it. With the
> >> dma_fence you do know who's the offender - you might not know why, but
> >> that doesn't matter, you just shred the entire context and let that
> >> userspace figure out the details.
> >>
> >> I think trying to make memory fences work as implicit sync directly,
> >> without wrapping them in a dma_fence and assorted guarantees, will just
> >> not work.
> >>
> >> And once you do wrap them in dma_fence, then all the other problems go
> >> away: cross-driver sync, syncfiles, ... So I really don't see the
> benefit
> >> of this half-way approach.
> >>
> >> Yes there's going to be a tad bit of overhead, but that's already there
> in
> >> the current model. And it can't hurt to have a bit of motivation for
> >> compositors to switch over to userspace memory fences properly.
> >
> >
> > Well, Christian thinks that we need a high level synchronization
> primitive in hw. I don't know myself and you may be right. A software
> scheduler with user queues might be one option. My part is only to find out
> how much of the scheduler logic can be moved to the hardware.
> >
> > We plan to have memory timeline semaphores, or simply monotonic
> counters, and a fence will be represented by the counter address and a
> constant sequence number for the <= comparison. One counter can represent
> up to 2^64 different fences. Giving any process write access to a fence is
> the same as giving it the power to manipulate the signalled state of a
> sequence of up to 2^64 fences. That could mess up a lot of things. However,
> if the hardware had a high level synchronization primitive with access
> rights and a limited set of clearly defined operations such that we can
> formally prove whether it's safe for everybody, we could have a solution
> where we don't have to involve the software scheduler and just let the
> hardware do everything.
>
> I don't think hw access rights control on memory fences makes sense.
> There's two cases:
>
> - brave new world of native userspace memory fences. Currently that's
> compute, maybe direct display vk, hopefully/eventually compositors and
> desktops too. If you get an untrusted fence, you need to have fallback
> logic no matter what, and by design. vk is explicit in stating that if
> things hang, you get to keep all the pieces. So the compositor needs
> to _always_ treat userspace memory fences as hostile, wrap them in a
> timeout, and have a fallback frame/scene in its rendering path.
> Probably same for the kernel on display side, maybe down to the
> display hw picking the "right" frame depending upon the fence value
> right before scanout as we discussed earlier. There's no point in hw
> access rights because by design, even if no one tampers with your
> fence, it might never signal. So you have to cope with a hostile fence
> from untrusted sources anyway (and within an app it's trusted and you
> just die as in stuck in an endless loop until someon sends a SIGKILL
> when you deadlock or get it wrong some other way).
>
> - old compat mode where we need to use dma_fence, otherwise we end up
> with another round of "amdgpu redefines implicit sync in incompatible
> ways", and Christian&me don't even know yet how to fix the current
> round without breaking use-cases badly yet. So it has to be dma_fence,
> and it has to be the same rules as on old hw, or it's just not going
> to work. This means you need to force in-order retiring of fences in
> the kernel, and you need to enforce timeout. None of this needs hw
> access rights control, since once more it's just software constructs
> in the kernel. As a first appromixation, drm/scheduler + the fence
> chain we already have in syncobj is probably good enough for this.
> E.g. if userspace rolls the fence backwards then the kernel just
> ignores that, because its internal dma_fence has signalled, and
> dma_fences never unsignal (it's a bit in struct dma_fence, once it's
> set we stop checking hw). And if it doesn't signal in time, then tdr
> jumps in and fixes the mess. Hw access control doesn't fix anything
> here, because you have to deal with timeout and ordering already
> anyway, or the dma_fence contract is broken.
>
> So in both cases hw access control gains you nothing (at least I'm not
> seeing anything), it just makes the design more tricky. "Userspace can
> manipulate the fences" is _intentionally_ how these things work, we
> need a design that works with that hw design, not against it and
> somehow tries to get us back to the old world, but only halfway (i.e.
> not useful at all, since old userspace needs us to go all the way back
> to dma_fence, and new userspace wants to fully exploit userspace
> memory fences without artificial limitations for no reason).
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #2: Type: text/html, Size: 17043 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03 17:52                                 ` Marek Olšák
@ 2021-06-03 19:18                                   ` Daniel Vetter
  2021-06-04  5:26                                     ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-03 19:18 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> Daniel, I think what you are suggesting is that we need to enable user queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.

Yeah I guess that's another way to look at it. Maybe in practice
you'll just move it from the kernel to userspace, which then programs
the hw waits directly into its IB. That's at least how I'd do it on
i915, assuming I'd have such hw. So these fences that userspace
programs directly (to sync within itself) won't even show up as
dependencies in the kernel.

And then yes on the other side you can lift work from the
drm/scheduler wrt dependencies you get in the kernel (whether explicit
sync with sync_file, or implicit sync fished out of dma_resv) and
program the hw directly that way. That would mean that userspace wont
fill the ringbuffer directly, but the kernel would do that, so that
you have space to stuff in the additional waits. Again assuming i915
hw model, maybe works differently on amd. Iirc we have some of that
already in the i915 scheduler, but I'd need to recheck how much it
really uses the hw semaphores.
-Daniel

> Thanks,
> Marek
>
> On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>
>> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák <maraeo@gmail.com> wrote:
>> >
>> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:
>> >>
>> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
>> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> >> >
>> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
>> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> >> > > >
>> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
>> >> > > > > >
>> >> > > > > > > Yes, we can't break anything because we don't want to complicate
>> >> > > things
>> >> > > > > > > for us. It's pretty much all NAK'd already. We are trying to gather
>> >> > > > > more
>> >> > > > > > > knowledge and then make better decisions.
>> >> > > > > > >
>> >> > > > > > > The idea we are considering is that we'll expose memory-based sync
>> >> > > > > objects
>> >> > > > > > > to userspace for read only, and the kernel or hw will strictly
>> >> > > control
>> >> > > > > the
>> >> > > > > > > memory writes to those sync objects. The hole in that idea is that
>> >> > > > > > > userspace can decide not to signal a job, so even if userspace
>> >> > > can't
>> >> > > > > > > overwrite memory-based sync object states arbitrarily, it can still
>> >> > > > > decide
>> >> > > > > > > not to signal them, and then a future fence is born.
>> >> > > > > > >
>> >> > > > > >
>> >> > > > > > This would actually be treated as a GPU hang caused by that context,
>> >> > > so
>> >> > > > > it
>> >> > > > > > should be fine.
>> >> > > > >
>> >> > > > > This is practically what I proposed already, except your not doing it
>> >> > > with
>> >> > > > > dma_fence. And on the memory fence side this also doesn't actually give
>> >> > > > > what you want for that compute model.
>> >> > > > >
>> >> > > > > This seems like a bit a worst of both worlds approach to me? Tons of
>> >> > > work
>> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
>> >> > > > > actually drive the hardware like it should be for compute or direct
>> >> > > > > display.
>> >> > > > >
>> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
>> >> > > suggestion
>> >> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
>> >> > > > > interesting to know what doesn't work there instead of amd folks going
>> >> > > of
>> >> > > > > into internal again and then coming back with another rfc from out of
>> >> > > > > nowhere :-)
>> >> > > > >
>> >> > > >
>> >> > > > Going internal again is probably a good idea to spare you the long
>> >> > > > discussions and not waste your time, but we haven't talked about the
>> >> > > > dma_fence stuff internally other than acknowledging that it can be
>> >> > > solved.
>> >> > > >
>> >> > > > The compute use case already uses the hw as-is with no inter-process
>> >> > > > sharing, which mostly keeps the kernel out of the picture. It uses
>> >> > > glFinish
>> >> > > > to sync with GL.
>> >> > > >
>> >> > > > The gfx use case needs new hardware logic to support implicit and
>> >> > > explicit
>> >> > > > sync. When we propose a solution, it's usually torn apart the next day by
>> >> > > > ourselves.
>> >> > > >
>> >> > > > Since we are talking about next hw or next next hw, preemption should be
>> >> > > > better.
>> >> > > >
>> >> > > > user queue = user-mapped ring buffer
>> >> > > >
>> >> > > > For implicit sync, we will only let userspace lock access to a buffer
>> >> > > via a
>> >> > > > user queue, which waits for the per-buffer sequence counter in memory to
>> >> > > be
>> >> > > > >= the number assigned by the kernel, and later unlock the access with
>> >> > > > another command, which increments the per-buffer sequence counter in
>> >> > > memory
>> >> > > > with atomic_inc regardless of the number assigned by the kernel. The
>> >> > > kernel
>> >> > > > counter and the counter in memory can be out-of-sync, and I'll explain
>> >> > > why
>> >> > > > it's OK. If a process increments the kernel counter but not the memory
>> >> > > > counter, that's its problem and it's the same as a GPU hang caused by
>> >> > > that
>> >> > > > process. If a process increments the memory counter but not the kernel
>> >> > > > counter, the ">=" condition alongside atomic_inc guarantee that
>> >> > > signaling n
>> >> > > > will signal n+1, so it will never deadlock but also it will effectively
>> >> > > > disable synchronization. This method of disabling synchronization is
>> >> > > > similar to a process corrupting the buffer, which should be fine. Can you
>> >> > > > find any flaw in it? I can't find any.
>> >> > >
>> >> > > Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
>> >> > > of "we let userspace free-wheel whatever it wants, kernel ensures
>> >> > > correctness of the resulting chain of dma_fence with reset the entire
>> >> > > context" is what I proposed too.
>> >> > >
>> >> > > Like you say, userspace is allowed to render garbage already.
>> >> > >
>> >> > > > The explicit submit can be done by userspace (if there is no
>> >> > > > synchronization), but we plan to use the kernel to do it for implicit
>> >> > > sync.
>> >> > > > Essentially, the kernel will receive a buffer list and addresses of wait
>> >> > > > commands in the user queue. It will assign new sequence numbers to all
>> >> > > > buffers and write those numbers into the wait commands, and ring the hw
>> >> > > > doorbell to start execution of that queue.
>> >> > >
>> >> > > Yeah for implicit sync I think kernel and using drm/scheduler to sort out
>> >> > > the dma_fence dependencies is probably best. Since you can filter out
>> >> > > which dma_fence you hand to the scheduler for dependency tracking you can
>> >> > > filter out your own ones and let the hw handle those directly (depending
>> >> > > how much your hw can do an all that). On i915 we might do that to be able
>> >> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
>> >> > >
>> >> > > For buffer tracking with implicit sync I think cleanest is probably to
>> >> > > still keep them wrapped as dma_fence and stuffed into dma_resv, but
>> >> > > conceptually it's the same. If we let every driver reinvent their own
>> >> > > buffer tracking just because the hw works a bit different it'll be a mess.
>> >> > >
>> >> > > Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets
>> >> > > to keep the pieces if it gets it wrong. You do still need to handle
>> >> > > external dma_fence though, hence drm/scheduler frontend to sort these out.
>> >> > >
>> >> >
>> >> > The reason is to disallow lower-privileged process to deadlock/hang a
>> >> > higher-privileged process where the kernel can't tell who did it. If the
>> >> > implicit-sync sequence counter is read only to userspace and only
>> >> > incrementable by the unlock-signal command after the lock-wait command
>> >> > appeared in the same queue (both together forming a critical section),
>> >> > userspace can't manipulate it arbitrarily and we get almost the exact same
>> >> > behavior as implicit sync has today. That means any implicitly-sync'd
>> >> > buffer from any process can be fully trusted by a compositor to signal in a
>> >> > finite time, and possibly even trusted by the kernel. The only thing that's
>> >> > different is that a malicious process can disable implicit sync for a
>> >> > buffer in all processes/kernel, but it can't hang other processes/kernel
>> >> > (it can only hang itself and the kernel will be notified). So I'm a happy
>> >> > panda now. :)
>> >>
>> >> Yeah I think that's not going to work too well, and is too many piled up
>> >> hacks. Within a drm_file fd you can do whatever you feel like, since it's
>> >> just one client.
>> >>
>> >> But once implicit sync kicks in I think you need to go with dma_fence and
>> >> drm/scheduler to handle the dependencies, and tdr kicking it. With the
>> >> dma_fence you do know who's the offender - you might not know why, but
>> >> that doesn't matter, you just shred the entire context and let that
>> >> userspace figure out the details.
>> >>
>> >> I think trying to make memory fences work as implicit sync directly,
>> >> without wrapping them in a dma_fence and assorted guarantees, will just
>> >> not work.
>> >>
>> >> And once you do wrap them in dma_fence, then all the other problems go
>> >> away: cross-driver sync, syncfiles, ... So I really don't see the benefit
>> >> of this half-way approach.
>> >>
>> >> Yes there's going to be a tad bit of overhead, but that's already there in
>> >> the current model. And it can't hurt to have a bit of motivation for
>> >> compositors to switch over to userspace memory fences properly.
>> >
>> >
>> > Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
>> >
>> > We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.
>>
>> I don't think hw access rights control on memory fences makes sense.
>> There's two cases:
>>
>> - brave new world of native userspace memory fences. Currently that's
>> compute, maybe direct display vk, hopefully/eventually compositors and
>> desktops too. If you get an untrusted fence, you need to have fallback
>> logic no matter what, and by design. vk is explicit in stating that if
>> things hang, you get to keep all the pieces. So the compositor needs
>> to _always_ treat userspace memory fences as hostile, wrap them in a
>> timeout, and have a fallback frame/scene in its rendering path.
>> Probably same for the kernel on display side, maybe down to the
>> display hw picking the "right" frame depending upon the fence value
>> right before scanout as we discussed earlier. There's no point in hw
>> access rights because by design, even if no one tampers with your
>> fence, it might never signal. So you have to cope with a hostile fence
>> from untrusted sources anyway (and within an app it's trusted and you
>> just die as in stuck in an endless loop until someon sends a SIGKILL
>> when you deadlock or get it wrong some other way).
>>
>> - old compat mode where we need to use dma_fence, otherwise we end up
>> with another round of "amdgpu redefines implicit sync in incompatible
>> ways", and Christian&me don't even know yet how to fix the current
>> round without breaking use-cases badly yet. So it has to be dma_fence,
>> and it has to be the same rules as on old hw, or it's just not going
>> to work. This means you need to force in-order retiring of fences in
>> the kernel, and you need to enforce timeout. None of this needs hw
>> access rights control, since once more it's just software constructs
>> in the kernel. As a first appromixation, drm/scheduler + the fence
>> chain we already have in syncobj is probably good enough for this.
>> E.g. if userspace rolls the fence backwards then the kernel just
>> ignores that, because its internal dma_fence has signalled, and
>> dma_fences never unsignal (it's a bit in struct dma_fence, once it's
>> set we stop checking hw). And if it doesn't signal in time, then tdr
>> jumps in and fixes the mess. Hw access control doesn't fix anything
>> here, because you have to deal with timeout and ordering already
>> anyway, or the dma_fence contract is broken.
>>
>> So in both cases hw access control gains you nothing (at least I'm not
>> seeing anything), it just makes the design more tricky. "Userspace can
>> manipulate the fences" is _intentionally_ how these things work, we
>> need a design that works with that hw design, not against it and
>> somehow tries to get us back to the old world, but only halfway (i.e.
>> not useful at all, since old userspace needs us to go all the way back
>> to dma_fence, and new userspace wants to fully exploit userspace
>> memory fences without artificial limitations for no reason).
>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-03 19:18                                   ` Daniel Vetter
@ 2021-06-04  5:26                                     ` Marek Olšák
  0 siblings, 0 replies; 50+ messages in thread
From: Marek Olšák @ 2021-06-04  5:26 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 16211 bytes --]

On Thu., Jun. 3, 2021, 15:18 Daniel Vetter, <daniel@ffwll.ch> wrote:

> On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > Daniel, I think what you are suggesting is that we need to enable user
> queues with the drm scheduler and dma_fence first, and once that works, we
> can investigate how much of that kernel logic can be moved to the hw. Would
> that work? In theory it shouldn't matter whether the kernel does it or the
> hw does it. It's the same code, just in a different place.
>
> Yeah I guess that's another way to look at it. Maybe in practice
> you'll just move it from the kernel to userspace, which then programs
> the hw waits directly into its IB. That's at least how I'd do it on
> i915, assuming I'd have such hw. So these fences that userspace
> programs directly (to sync within itself) won't even show up as
> dependencies in the kernel.
>
> And then yes on the other side you can lift work from the
> drm/scheduler wrt dependencies you get in the kernel (whether explicit
> sync with sync_file, or implicit sync fished out of dma_resv) and
> program the hw directly that way. That would mean that userspace wont
> fill the ringbuffer directly, but the kernel would do that, so that
> you have space to stuff in the additional waits. Again assuming i915
> hw model, maybe works differently on amd. Iirc we have some of that
> already in the i915 scheduler, but I'd need to recheck how much it
> really uses the hw semaphores.
>

I was thinking we would pass per process syncobj handles and buffer handles
into commands in the user queue, or something equivalent. We do have a
large degree of programmability in the hw that we can do something like
that. The question is whether this high level user->hw interface would have
any advantage over trivial polling on memory, etc. My impression is no
because the kernel would be robust enough that it wouldn't matter what
userspace does, but I don't know. Anyway, all we need is user queues and
what your proposed seems totally sufficient.

Marek

-Daniel
>
> > Thanks,
> > Marek
> >
> > On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>
> >> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák <maraeo@gmail.com> wrote:
> >> >
> >> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:
> >> >>
> >> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch>
> wrote:
> >> >> >
> >> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch>
> wrote:
> >> >> > > >
> >> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <
> maraeo@gmail.com> wrote:
> >> >> > > > > >
> >> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> >> > > things
> >> >> > > > > > > for us. It's pretty much all NAK'd already. We are
> trying to gather
> >> >> > > > > more
> >> >> > > > > > > knowledge and then make better decisions.
> >> >> > > > > > >
> >> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> >> > > > > objects
> >> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> >> > > control
> >> >> > > > > the
> >> >> > > > > > > memory writes to those sync objects. The hole in that
> idea is that
> >> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> >> > > can't
> >> >> > > > > > > overwrite memory-based sync object states arbitrarily,
> it can still
> >> >> > > > > decide
> >> >> > > > > > > not to signal them, and then a future fence is born.
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > > This would actually be treated as a GPU hang caused by
> that context,
> >> >> > > so
> >> >> > > > > it
> >> >> > > > > > should be fine.
> >> >> > > > >
> >> >> > > > > This is practically what I proposed already, except your not
> doing it
> >> >> > > with
> >> >> > > > > dma_fence. And on the memory fence side this also doesn't
> actually give
> >> >> > > > > what you want for that compute model.
> >> >> > > > >
> >> >> > > > > This seems like a bit a worst of both worlds approach to me?
> Tons of
> >> >> > > work
> >> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and
> still pain to
> >> >> > > > > actually drive the hardware like it should be for compute or
> direct
> >> >> > > > > display.
> >> >> > > > >
> >> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
> >> >> > > suggestion
> >> >> > > > > how to fake the entire dma_fence stuff on top of new hw.
> Would be
> >> >> > > > > interesting to know what doesn't work there instead of amd
> folks going
> >> >> > > of
> >> >> > > > > into internal again and then coming back with another rfc
> from out of
> >> >> > > > > nowhere :-)
> >> >> > > > >
> >> >> > > >
> >> >> > > > Going internal again is probably a good idea to spare you the
> long
> >> >> > > > discussions and not waste your time, but we haven't talked
> about the
> >> >> > > > dma_fence stuff internally other than acknowledging that it
> can be
> >> >> > > solved.
> >> >> > > >
> >> >> > > > The compute use case already uses the hw as-is with no
> inter-process
> >> >> > > > sharing, which mostly keeps the kernel out of the picture. It
> uses
> >> >> > > glFinish
> >> >> > > > to sync with GL.
> >> >> > > >
> >> >> > > > The gfx use case needs new hardware logic to support implicit
> and
> >> >> > > explicit
> >> >> > > > sync. When we propose a solution, it's usually torn apart the
> next day by
> >> >> > > > ourselves.
> >> >> > > >
> >> >> > > > Since we are talking about next hw or next next hw, preemption
> should be
> >> >> > > > better.
> >> >> > > >
> >> >> > > > user queue = user-mapped ring buffer
> >> >> > > >
> >> >> > > > For implicit sync, we will only let userspace lock access to a
> buffer
> >> >> > > via a
> >> >> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> >> >> > > be
> >> >> > > > >= the number assigned by the kernel, and later unlock the
> access with
> >> >> > > > another command, which increments the per-buffer sequence
> counter in
> >> >> > > memory
> >> >> > > > with atomic_inc regardless of the number assigned by the
> kernel. The
> >> >> > > kernel
> >> >> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> >> >> > > why
> >> >> > > > it's OK. If a process increments the kernel counter but not
> the memory
> >> >> > > > counter, that's its problem and it's the same as a GPU hang
> caused by
> >> >> > > that
> >> >> > > > process. If a process increments the memory counter but not
> the kernel
> >> >> > > > counter, the ">=" condition alongside atomic_inc guarantee that
> >> >> > > signaling n
> >> >> > > > will signal n+1, so it will never deadlock but also it will
> effectively
> >> >> > > > disable synchronization. This method of disabling
> synchronization is
> >> >> > > > similar to a process corrupting the buffer, which should be
> fine. Can you
> >> >> > > > find any flaw in it? I can't find any.
> >> >> > >
> >> >> > > Hm maybe I misunderstood what exactly you wanted to do earlier.
> That kind
> >> >> > > of "we let userspace free-wheel whatever it wants, kernel ensures
> >> >> > > correctness of the resulting chain of dma_fence with reset the
> entire
> >> >> > > context" is what I proposed too.
> >> >> > >
> >> >> > > Like you say, userspace is allowed to render garbage already.
> >> >> > >
> >> >> > > > The explicit submit can be done by userspace (if there is no
> >> >> > > > synchronization), but we plan to use the kernel to do it for
> implicit
> >> >> > > sync.
> >> >> > > > Essentially, the kernel will receive a buffer list and
> addresses of wait
> >> >> > > > commands in the user queue. It will assign new sequence
> numbers to all
> >> >> > > > buffers and write those numbers into the wait commands, and
> ring the hw
> >> >> > > > doorbell to start execution of that queue.
> >> >> > >
> >> >> > > Yeah for implicit sync I think kernel and using drm/scheduler to
> sort out
> >> >> > > the dma_fence dependencies is probably best. Since you can
> filter out
> >> >> > > which dma_fence you hand to the scheduler for dependency
> tracking you can
> >> >> > > filter out your own ones and let the hw handle those directly
> (depending
> >> >> > > how much your hw can do an all that). On i915 we might do that
> to be able
> >> >> > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
> scheduler.
> >> >> > >
> >> >> > > For buffer tracking with implicit sync I think cleanest is
> probably to
> >> >> > > still keep them wrapped as dma_fence and stuffed into dma_resv,
> but
> >> >> > > conceptually it's the same. If we let every driver reinvent
> their own
> >> >> > > buffer tracking just because the hw works a bit different it'll
> be a mess.
> >> >> > >
> >> >> > > Wrt wait commands: I'm honestly not sure why you'd do that.
> Userspace gets
> >> >> > > to keep the pieces if it gets it wrong. You do still need to
> handle
> >> >> > > external dma_fence though, hence drm/scheduler frontend to sort
> these out.
> >> >> > >
> >> >> >
> >> >> > The reason is to disallow lower-privileged process to
> deadlock/hang a
> >> >> > higher-privileged process where the kernel can't tell who did it.
> If the
> >> >> > implicit-sync sequence counter is read only to userspace and only
> >> >> > incrementable by the unlock-signal command after the lock-wait
> command
> >> >> > appeared in the same queue (both together forming a critical
> section),
> >> >> > userspace can't manipulate it arbitrarily and we get almost the
> exact same
> >> >> > behavior as implicit sync has today. That means any
> implicitly-sync'd
> >> >> > buffer from any process can be fully trusted by a compositor to
> signal in a
> >> >> > finite time, and possibly even trusted by the kernel. The only
> thing that's
> >> >> > different is that a malicious process can disable implicit sync
> for a
> >> >> > buffer in all processes/kernel, but it can't hang other
> processes/kernel
> >> >> > (it can only hang itself and the kernel will be notified). So I'm
> a happy
> >> >> > panda now. :)
> >> >>
> >> >> Yeah I think that's not going to work too well, and is too many
> piled up
> >> >> hacks. Within a drm_file fd you can do whatever you feel like, since
> it's
> >> >> just one client.
> >> >>
> >> >> But once implicit sync kicks in I think you need to go with
> dma_fence and
> >> >> drm/scheduler to handle the dependencies, and tdr kicking it. With
> the
> >> >> dma_fence you do know who's the offender - you might not know why,
> but
> >> >> that doesn't matter, you just shred the entire context and let that
> >> >> userspace figure out the details.
> >> >>
> >> >> I think trying to make memory fences work as implicit sync directly,
> >> >> without wrapping them in a dma_fence and assorted guarantees, will
> just
> >> >> not work.
> >> >>
> >> >> And once you do wrap them in dma_fence, then all the other problems
> go
> >> >> away: cross-driver sync, syncfiles, ... So I really don't see the
> benefit
> >> >> of this half-way approach.
> >> >>
> >> >> Yes there's going to be a tad bit of overhead, but that's already
> there in
> >> >> the current model. And it can't hurt to have a bit of motivation for
> >> >> compositors to switch over to userspace memory fences properly.
> >> >
> >> >
> >> > Well, Christian thinks that we need a high level synchronization
> primitive in hw. I don't know myself and you may be right. A software
> scheduler with user queues might be one option. My part is only to find out
> how much of the scheduler logic can be moved to the hardware.
> >> >
> >> > We plan to have memory timeline semaphores, or simply monotonic
> counters, and a fence will be represented by the counter address and a
> constant sequence number for the <= comparison. One counter can represent
> up to 2^64 different fences. Giving any process write access to a fence is
> the same as giving it the power to manipulate the signalled state of a
> sequence of up to 2^64 fences. That could mess up a lot of things. However,
> if the hardware had a high level synchronization primitive with access
> rights and a limited set of clearly defined operations such that we can
> formally prove whether it's safe for everybody, we could have a solution
> where we don't have to involve the software scheduler and just let the
> hardware do everything.
> >>
> >> I don't think hw access rights control on memory fences makes sense.
> >> There's two cases:
> >>
> >> - brave new world of native userspace memory fences. Currently that's
> >> compute, maybe direct display vk, hopefully/eventually compositors and
> >> desktops too. If you get an untrusted fence, you need to have fallback
> >> logic no matter what, and by design. vk is explicit in stating that if
> >> things hang, you get to keep all the pieces. So the compositor needs
> >> to _always_ treat userspace memory fences as hostile, wrap them in a
> >> timeout, and have a fallback frame/scene in its rendering path.
> >> Probably same for the kernel on display side, maybe down to the
> >> display hw picking the "right" frame depending upon the fence value
> >> right before scanout as we discussed earlier. There's no point in hw
> >> access rights because by design, even if no one tampers with your
> >> fence, it might never signal. So you have to cope with a hostile fence
> >> from untrusted sources anyway (and within an app it's trusted and you
> >> just die as in stuck in an endless loop until someon sends a SIGKILL
> >> when you deadlock or get it wrong some other way).
> >>
> >> - old compat mode where we need to use dma_fence, otherwise we end up
> >> with another round of "amdgpu redefines implicit sync in incompatible
> >> ways", and Christian&me don't even know yet how to fix the current
> >> round without breaking use-cases badly yet. So it has to be dma_fence,
> >> and it has to be the same rules as on old hw, or it's just not going
> >> to work. This means you need to force in-order retiring of fences in
> >> the kernel, and you need to enforce timeout. None of this needs hw
> >> access rights control, since once more it's just software constructs
> >> in the kernel. As a first appromixation, drm/scheduler + the fence
> >> chain we already have in syncobj is probably good enough for this.
> >> E.g. if userspace rolls the fence backwards then the kernel just
> >> ignores that, because its internal dma_fence has signalled, and
> >> dma_fences never unsignal (it's a bit in struct dma_fence, once it's
> >> set we stop checking hw). And if it doesn't signal in time, then tdr
> >> jumps in and fixes the mess. Hw access control doesn't fix anything
> >> here, because you have to deal with timeout and ordering already
> >> anyway, or the dma_fence contract is broken.
> >>
> >> So in both cases hw access control gains you nothing (at least I'm not
> >> seeing anything), it just makes the design more tricky. "Userspace can
> >> manipulate the fences" is _intentionally_ how these things work, we
> >> need a design that works with that hw design, not against it and
> >> somehow tries to get us back to the old world, but only halfway (i.e.
> >> not useful at all, since old userspace needs us to go all the way back
> >> to dma_fence, and new userspace wants to fully exploit userspace
> >> memory fences without artificial limitations for no reason).
> >> -Daniel
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #2: Type: text/html, Size: 21562 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-02 19:19                       ` Daniel Vetter
@ 2021-06-04  7:00                         ` Christian König
  2021-06-04  8:57                           ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-04  7:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
>>
>> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
>>> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>>>> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
>>>>
>>>>> Yes, we can't break anything because we don't want to complicate things
>>>>> for us. It's pretty much all NAK'd already. We are trying to gather more
>>>>> knowledge and then make better decisions.
>>>>>
>>>>> The idea we are considering is that we'll expose memory-based sync objects
>>>>> to userspace for read only, and the kernel or hw will strictly control the
>>>>> memory writes to those sync objects. The hole in that idea is that
>>>>> userspace can decide not to signal a job, so even if userspace can't
>>>>> overwrite memory-based sync object states arbitrarily, it can still decide
>>>>> not to signal them, and then a future fence is born.
>>>>>
>>>> This would actually be treated as a GPU hang caused by that context, so it
>>>> should be fine.
>>> This is practically what I proposed already, except your not doing it with
>>> dma_fence. And on the memory fence side this also doesn't actually give
>>> what you want for that compute model.
>>>
>>> This seems like a bit a worst of both worlds approach to me? Tons of work
>>> in the kernel to hide these not-dma_fence-but-almost, and still pain to
>>> actually drive the hardware like it should be for compute or direct
>>> display.
>>>
>>> Also maybe I've missed it, but I didn't see any replies to my suggestion
>>> how to fake the entire dma_fence stuff on top of new hw. Would be
>>> interesting to know what doesn't work there instead of amd folks going of
>>> into internal again and then coming back with another rfc from out of
>>> nowhere :-)
>> Well to be honest I would just push back on our hardware/firmware guys that
>> we need to keep kernel queues forever before going down that route.
> I looked again, and you said the model wont work because preemption is way
> too slow, even when the context is idle.
>
> I guess at that point I got maybe too fed up and just figured "not my
> problem", but if preempt is too slow as the unload fence, you can do it
> with pte removal and tlb shootdown too (that is hopefully not too slow,
> otherwise your hw is just garbage and wont even be fast for direct submit
> compute workloads).

Have you seen that one here: 
https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush 
which can block interrupts as well is just madness.



>
> The only thing that you need to do when you use pte clearing + tlb
> shootdown instad of preemption as the unload fence for buffers that get
> moved is that if you get any gpu page fault, you don't serve that, but
> instead treat it as a tdr and shot the context permanently.
>
> So summarizing the model I proposed:
>
> - you allow userspace to directly write into the ringbuffer, and also
>    write the fences directly
>
> - actual submit is done by the kernel, using drm/scheduler. The kernel
>    blindly trusts userspace to set up everything else, and even just wraps
>    dma_fences around the userspace memory fences.
>
> - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
>    kernel shot the entire context and b) userspace recovers by setting up a
>    new ringbuffer
>
> - memory management is done using ttm only, you still need to supply the
>    buffer list (ofc that list includes the always present ones, so CS will
>    only get the list of special buffers like today). If you hw can't trun
>    gpu page faults and you ever get one we pull up the same old solution:
>    Kernel shots the entire context.
>
>    The important thing is that from the gpu pov memory management works
>    exactly like compute workload with direct submit, except that you just
>    terminate the context on _any_ page fault, instead of only those that go
>    somewhere where there's really no mapping and repair the others.
>
>    Also I guess from reading the old thread this means you'd disable page
>    fault retry because that is apparently also way too slow for anything.
>
> - memory management uses an unload fence. That unload fences waits for all
>    userspace memory fences (represented as dma_fence) to complete, with
>    maybe some fudge to busy-spin until we've reached the actual end of the
>    ringbuffer (maybe you have a IB tail there after the memory fence write,
>    we have that on intel hw), and it waits for the memory to get
>    "unloaded". This is either preemption, or pte clearing + tlb shootdown,
>    or whatever else your hw provides which is a) used for dynamic memory
>    management b) fast enough for actual memory management.
>
> - any time a context dies we force-complete all it's pending fences,
>    in-order ofc
>
> So from hw pov this looks 99% like direct userspace submit, with the exact
> same mappings, command sequences and everything else. The only difference
> is that the rinbuffer head/tail updates happen from drm/scheduler, instead
> of directly from userspace.
>
> None of this stuff needs funny tricks where the kernel controls the
> writes to memory fences, or where you need kernel ringbuffers, or anything
> like thist. Userspace is allowed to do anything stupid, the rules are
> guaranteed with:
>
> - we rely on the hw isolation features to work, but _exactly_ like compute
>    direct submit would too
>
> - dying on any page fault captures memory management issues
>
> - dying (without kernel recover, this is up to userspace if it cares) on
>    any tdr makes sure fences complete still
>
>> That syncfile and all that Android stuff isn't working out of the box with
>> the new shiny user queue submission model (which in turn is mostly because
>> of Windows) already raised some eyebrows here.
> I think if you really want to make sure the current linux stack doesn't
> break the _only_ option you have is provide a ctx mode that allows
> dma_fence and drm/scheduler to be used like today.

Yeah, but I still can just tell our hw/fw guys that we really really 
need to keep kernel queues or the whole Linux/Android infrastructure 
needs to get a compatibility layer like you describe above.

> For everything else it sounds you're a few years too late, because even
> just huge kernel changes wont happen in time. Much less rewriting
> userspace protocols.

Seconded, question is rather if we are going to start migrating at some 
point or if we should keep pushing on our hw/fw guys.

Christian.

> -Daniel


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-04  7:00                         ` Christian König
@ 2021-06-04  8:57                           ` Daniel Vetter
  2021-06-04 11:27                             ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-04  8:57 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > 
> > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > > > 
> > > > > > Yes, we can't break anything because we don't want to complicate things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > > > knowledge and then make better decisions.
> > > > > > 
> > > > > > The idea we are considering is that we'll expose memory-based sync objects
> > > > > > to userspace for read only, and the kernel or hw will strictly control the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > > 
> > > > > This would actually be treated as a GPU hang caused by that context, so it
> > > > > should be fine.
> > > > This is practically what I proposed already, except your not doing it with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > > 
> > > > This seems like a bit a worst of both worlds approach to me? Tons of work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > > 
> > > > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > Well to be honest I would just push back on our hardware/firmware guys that
> > > we need to keep kernel queues forever before going down that route.
> > I looked again, and you said the model wont work because preemption is way
> > too slow, even when the context is idle.
> > 
> > I guess at that point I got maybe too fed up and just figured "not my
> > problem", but if preempt is too slow as the unload fence, you can do it
> > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > otherwise your hw is just garbage and wont even be fast for direct submit
> > compute workloads).
> 
> Have you seen that one here:
> https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> 
> I've rejected it because I think polling for 6 seconds on a TLB flush which
> can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?

But yeah at that point I think you just have to bite one of the bullets.

The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".

> > The only thing that you need to do when you use pte clearing + tlb
> > shootdown instad of preemption as the unload fence for buffers that get
> > moved is that if you get any gpu page fault, you don't serve that, but
> > instead treat it as a tdr and shot the context permanently.
> > 
> > So summarizing the model I proposed:
> > 
> > - you allow userspace to directly write into the ringbuffer, and also
> >    write the fences directly
> > 
> > - actual submit is done by the kernel, using drm/scheduler. The kernel
> >    blindly trusts userspace to set up everything else, and even just wraps
> >    dma_fences around the userspace memory fences.
> > 
> > - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
> >    kernel shot the entire context and b) userspace recovers by setting up a
> >    new ringbuffer
> > 
> > - memory management is done using ttm only, you still need to supply the
> >    buffer list (ofc that list includes the always present ones, so CS will
> >    only get the list of special buffers like today). If you hw can't trun
> >    gpu page faults and you ever get one we pull up the same old solution:
> >    Kernel shots the entire context.
> > 
> >    The important thing is that from the gpu pov memory management works
> >    exactly like compute workload with direct submit, except that you just
> >    terminate the context on _any_ page fault, instead of only those that go
> >    somewhere where there's really no mapping and repair the others.
> > 
> >    Also I guess from reading the old thread this means you'd disable page
> >    fault retry because that is apparently also way too slow for anything.
> > 
> > - memory management uses an unload fence. That unload fences waits for all
> >    userspace memory fences (represented as dma_fence) to complete, with
> >    maybe some fudge to busy-spin until we've reached the actual end of the
> >    ringbuffer (maybe you have a IB tail there after the memory fence write,
> >    we have that on intel hw), and it waits for the memory to get
> >    "unloaded". This is either preemption, or pte clearing + tlb shootdown,
> >    or whatever else your hw provides which is a) used for dynamic memory
> >    management b) fast enough for actual memory management.
> > 
> > - any time a context dies we force-complete all it's pending fences,
> >    in-order ofc
> > 
> > So from hw pov this looks 99% like direct userspace submit, with the exact
> > same mappings, command sequences and everything else. The only difference
> > is that the rinbuffer head/tail updates happen from drm/scheduler, instead
> > of directly from userspace.
> > 
> > None of this stuff needs funny tricks where the kernel controls the
> > writes to memory fences, or where you need kernel ringbuffers, or anything
> > like thist. Userspace is allowed to do anything stupid, the rules are
> > guaranteed with:
> > 
> > - we rely on the hw isolation features to work, but _exactly_ like compute
> >    direct submit would too
> > 
> > - dying on any page fault captures memory management issues
> > 
> > - dying (without kernel recover, this is up to userspace if it cares) on
> >    any tdr makes sure fences complete still
> > 
> > > That syncfile and all that Android stuff isn't working out of the box with
> > > the new shiny user queue submission model (which in turn is mostly because
> > > of Windows) already raised some eyebrows here.
> > I think if you really want to make sure the current linux stack doesn't
> > break the _only_ option you have is provide a ctx mode that allows
> > dma_fence and drm/scheduler to be used like today.
> 
> Yeah, but I still can just tell our hw/fw guys that we really really need to
> keep kernel queues or the whole Linux/Android infrastructure needs to get a
> compatibility layer like you describe above.
> 
> > For everything else it sounds you're a few years too late, because even
> > just huge kernel changes wont happen in time. Much less rewriting
> > userspace protocols.
> 
> Seconded, question is rather if we are going to start migrating at some
> point or if we should keep pushing on our hw/fw guys.

So from what I'm hearing other hw might gain the sw compat layer too. Plus
I'm hoping that with the sw compat layer it'd be easier to smooth over
userspace to the new model (because there will be a long time where we
have to support both, maybe even with a runtime switch from userspace
memory fences to dma_fence kernel stuff).

But in the end it's up to you what makes more sense between sw work and
hw/fw work involved.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-04  8:57                           ` Daniel Vetter
@ 2021-06-04 11:27                             ` Christian König
  2021-06-09 13:19                               ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-04 11:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
>> Am 02.06.21 um 21:19 schrieb Daniel Vetter:
>>> On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
>>>> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
>>>>> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>>>>>> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, we can't break anything because we don't want to complicate things
>>>>>>> for us. It's pretty much all NAK'd already. We are trying to gather more
>>>>>>> knowledge and then make better decisions.
>>>>>>>
>>>>>>> The idea we are considering is that we'll expose memory-based sync objects
>>>>>>> to userspace for read only, and the kernel or hw will strictly control the
>>>>>>> memory writes to those sync objects. The hole in that idea is that
>>>>>>> userspace can decide not to signal a job, so even if userspace can't
>>>>>>> overwrite memory-based sync object states arbitrarily, it can still decide
>>>>>>> not to signal them, and then a future fence is born.
>>>>>>>
>>>>>> This would actually be treated as a GPU hang caused by that context, so it
>>>>>> should be fine.
>>>>> This is practically what I proposed already, except your not doing it with
>>>>> dma_fence. And on the memory fence side this also doesn't actually give
>>>>> what you want for that compute model.
>>>>>
>>>>> This seems like a bit a worst of both worlds approach to me? Tons of work
>>>>> in the kernel to hide these not-dma_fence-but-almost, and still pain to
>>>>> actually drive the hardware like it should be for compute or direct
>>>>> display.
>>>>>
>>>>> Also maybe I've missed it, but I didn't see any replies to my suggestion
>>>>> how to fake the entire dma_fence stuff on top of new hw. Would be
>>>>> interesting to know what doesn't work there instead of amd folks going of
>>>>> into internal again and then coming back with another rfc from out of
>>>>> nowhere :-)
>>>> Well to be honest I would just push back on our hardware/firmware guys that
>>>> we need to keep kernel queues forever before going down that route.
>>> I looked again, and you said the model wont work because preemption is way
>>> too slow, even when the context is idle.
>>>
>>> I guess at that point I got maybe too fed up and just figured "not my
>>> problem", but if preempt is too slow as the unload fence, you can do it
>>> with pte removal and tlb shootdown too (that is hopefully not too slow,
>>> otherwise your hw is just garbage and wont even be fast for direct submit
>>> compute workloads).
>> Have you seen that one here:
>> https://www.spinics.net/lists/amd-gfx/msg63101.html :)
>>
>> I've rejected it because I think polling for 6 seconds on a TLB flush which
>> can block interrupts as well is just madness.
> Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> retrying page faults) and the not so shitty one?

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of 
the PTEs currently in flight in the 3D/DMA engine and you just need to 
invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* 
make sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

> But yeah at that point I think you just have to bite one of the bullets.

Yeah, completely agree. We can choose which way we want to die, but it's 
certainly not going to be nice whatever we do.

>
> The thing is with hmm/userspace memory fence model this will be even
> worse, because you will _have_ to do this tlb flush deep down in core mm
> functions, so this is going to be userptr, but worse.
>
> With the dma_resv/dma_fence bo memory management model you can at least
> wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
> separate thread or something like that. If the hw really is that slow.
>
> Somewhat aside: Personally I think that sriov needs to move over to the
> compute model, i.e. indefinite timeouts, no tdr, because everything takes
> too long. At least looking around sriov timeouts tend to be 10x bare
> metal, across the board.
>
> But for stuff like cloud gaming that's serious amounts of heavy lifting
> since it brings us right back "the entire linux/android 3d stack is built
> on top of dma_fence right now".
>
>>> The only thing that you need to do when you use pte clearing + tlb
>>> shootdown instad of preemption as the unload fence for buffers that get
>>> moved is that if you get any gpu page fault, you don't serve that, but
>>> instead treat it as a tdr and shot the context permanently.
>>>
>>> So summarizing the model I proposed:
>>>
>>> - you allow userspace to directly write into the ringbuffer, and also
>>>     write the fences directly
>>>
>>> - actual submit is done by the kernel, using drm/scheduler. The kernel
>>>     blindly trusts userspace to set up everything else, and even just wraps
>>>     dma_fences around the userspace memory fences.
>>>
>>> - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
>>>     kernel shot the entire context and b) userspace recovers by setting up a
>>>     new ringbuffer
>>>
>>> - memory management is done using ttm only, you still need to supply the
>>>     buffer list (ofc that list includes the always present ones, so CS will
>>>     only get the list of special buffers like today). If you hw can't trun
>>>     gpu page faults and you ever get one we pull up the same old solution:
>>>     Kernel shots the entire context.
>>>
>>>     The important thing is that from the gpu pov memory management works
>>>     exactly like compute workload with direct submit, except that you just
>>>     terminate the context on _any_ page fault, instead of only those that go
>>>     somewhere where there's really no mapping and repair the others.
>>>
>>>     Also I guess from reading the old thread this means you'd disable page
>>>     fault retry because that is apparently also way too slow for anything.
>>>
>>> - memory management uses an unload fence. That unload fences waits for all
>>>     userspace memory fences (represented as dma_fence) to complete, with
>>>     maybe some fudge to busy-spin until we've reached the actual end of the
>>>     ringbuffer (maybe you have a IB tail there after the memory fence write,
>>>     we have that on intel hw), and it waits for the memory to get
>>>     "unloaded". This is either preemption, or pte clearing + tlb shootdown,
>>>     or whatever else your hw provides which is a) used for dynamic memory
>>>     management b) fast enough for actual memory management.
>>>
>>> - any time a context dies we force-complete all it's pending fences,
>>>     in-order ofc
>>>
>>> So from hw pov this looks 99% like direct userspace submit, with the exact
>>> same mappings, command sequences and everything else. The only difference
>>> is that the rinbuffer head/tail updates happen from drm/scheduler, instead
>>> of directly from userspace.
>>>
>>> None of this stuff needs funny tricks where the kernel controls the
>>> writes to memory fences, or where you need kernel ringbuffers, or anything
>>> like thist. Userspace is allowed to do anything stupid, the rules are
>>> guaranteed with:
>>>
>>> - we rely on the hw isolation features to work, but _exactly_ like compute
>>>     direct submit would too
>>>
>>> - dying on any page fault captures memory management issues
>>>
>>> - dying (without kernel recover, this is up to userspace if it cares) on
>>>     any tdr makes sure fences complete still
>>>
>>>> That syncfile and all that Android stuff isn't working out of the box with
>>>> the new shiny user queue submission model (which in turn is mostly because
>>>> of Windows) already raised some eyebrows here.
>>> I think if you really want to make sure the current linux stack doesn't
>>> break the _only_ option you have is provide a ctx mode that allows
>>> dma_fence and drm/scheduler to be used like today.
>> Yeah, but I still can just tell our hw/fw guys that we really really need to
>> keep kernel queues or the whole Linux/Android infrastructure needs to get a
>> compatibility layer like you describe above.
>>
>>> For everything else it sounds you're a few years too late, because even
>>> just huge kernel changes wont happen in time. Much less rewriting
>>> userspace protocols.
>> Seconded, question is rather if we are going to start migrating at some
>> point or if we should keep pushing on our hw/fw guys.
> So from what I'm hearing other hw might gain the sw compat layer too. Plus
> I'm hoping that with the sw compat layer it'd be easier to smooth over
> userspace to the new model (because there will be a long time where we
> have to support both, maybe even with a runtime switch from userspace
> memory fences to dma_fence kernel stuff).
>
> But in the end it's up to you what makes more sense between sw work and
> hw/fw work involved.

I'm currently entertaining my head a bit with the idea of implementing 
the HW scheduling on the CPU.

The only obstacle that I can really see is that we might need to unmap a 
page in the CPU page table when a queue becomes idle.

But apart from that it would give us the required functionality, just 
without the hardware scheduler.

Christian.

> -Daniel


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-04 11:27                             ` Christian König
@ 2021-06-09 13:19                               ` Daniel Vetter
  2021-06-09 13:58                                 ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-09 13:19 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:
> Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> > On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> > > Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > > > > > 
> > > > > > > > Yes, we can't break anything because we don't want to complicate things
> > > > > > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > > > > > knowledge and then make better decisions.
> > > > > > > > 
> > > > > > > > The idea we are considering is that we'll expose memory-based sync objects
> > > > > > > > to userspace for read only, and the kernel or hw will strictly control the
> > > > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > > > userspace can decide not to signal a job, so even if userspace can't
> > > > > > > > overwrite memory-based sync object states arbitrarily, it can still decide
> > > > > > > > not to signal them, and then a future fence is born.
> > > > > > > > 
> > > > > > > This would actually be treated as a GPU hang caused by that context, so it
> > > > > > > should be fine.
> > > > > > This is practically what I proposed already, except your not doing it with
> > > > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > > > what you want for that compute model.
> > > > > > 
> > > > > > This seems like a bit a worst of both worlds approach to me? Tons of work
> > > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > > > actually drive the hardware like it should be for compute or direct
> > > > > > display.
> > > > > > 
> > > > > > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > > interesting to know what doesn't work there instead of amd folks going of
> > > > > > into internal again and then coming back with another rfc from out of
> > > > > > nowhere :-)
> > > > > Well to be honest I would just push back on our hardware/firmware guys that
> > > > > we need to keep kernel queues forever before going down that route.
> > > > I looked again, and you said the model wont work because preemption is way
> > > > too slow, even when the context is idle.
> > > > 
> > > > I guess at that point I got maybe too fed up and just figured "not my
> > > > problem", but if preempt is too slow as the unload fence, you can do it
> > > > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > > > otherwise your hw is just garbage and wont even be fast for direct submit
> > > > compute workloads).
> > > Have you seen that one here:
> > > https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> > > 
> > > I've rejected it because I think polling for 6 seconds on a TLB flush which
> > > can block interrupts as well is just madness.
> > Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> > retrying page faults) and the not so shitty one?
> 
> Yeah, we call this the lightweight and the heavyweight tlb flush.
> 
> The lighweight can be used when you are sure that you don't have any of the
> PTEs currently in flight in the 3D/DMA engine and you just need to
> invalidate the TLB.
> 
> The heavyweight must be used when you need to invalidate the TLB *AND* make
> sure that no concurrently operation moves new stuff into the TLB.
> 
> The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?
-Daniel

> > But yeah at that point I think you just have to bite one of the bullets.
> 
> Yeah, completely agree. We can choose which way we want to die, but it's
> certainly not going to be nice whatever we do.
> 
> > 
> > The thing is with hmm/userspace memory fence model this will be even
> > worse, because you will _have_ to do this tlb flush deep down in core mm
> > functions, so this is going to be userptr, but worse.
> > 
> > With the dma_resv/dma_fence bo memory management model you can at least
> > wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
> > separate thread or something like that. If the hw really is that slow.
> > 
> > Somewhat aside: Personally I think that sriov needs to move over to the
> > compute model, i.e. indefinite timeouts, no tdr, because everything takes
> > too long. At least looking around sriov timeouts tend to be 10x bare
> > metal, across the board.
> > 
> > But for stuff like cloud gaming that's serious amounts of heavy lifting
> > since it brings us right back "the entire linux/android 3d stack is built
> > on top of dma_fence right now".
> > 
> > > > The only thing that you need to do when you use pte clearing + tlb
> > > > shootdown instad of preemption as the unload fence for buffers that get
> > > > moved is that if you get any gpu page fault, you don't serve that, but
> > > > instead treat it as a tdr and shot the context permanently.
> > > > 
> > > > So summarizing the model I proposed:
> > > > 
> > > > - you allow userspace to directly write into the ringbuffer, and also
> > > >     write the fences directly
> > > > 
> > > > - actual submit is done by the kernel, using drm/scheduler. The kernel
> > > >     blindly trusts userspace to set up everything else, and even just wraps
> > > >     dma_fences around the userspace memory fences.
> > > > 
> > > > - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
> > > >     kernel shot the entire context and b) userspace recovers by setting up a
> > > >     new ringbuffer
> > > > 
> > > > - memory management is done using ttm only, you still need to supply the
> > > >     buffer list (ofc that list includes the always present ones, so CS will
> > > >     only get the list of special buffers like today). If you hw can't trun
> > > >     gpu page faults and you ever get one we pull up the same old solution:
> > > >     Kernel shots the entire context.
> > > > 
> > > >     The important thing is that from the gpu pov memory management works
> > > >     exactly like compute workload with direct submit, except that you just
> > > >     terminate the context on _any_ page fault, instead of only those that go
> > > >     somewhere where there's really no mapping and repair the others.
> > > > 
> > > >     Also I guess from reading the old thread this means you'd disable page
> > > >     fault retry because that is apparently also way too slow for anything.
> > > > 
> > > > - memory management uses an unload fence. That unload fences waits for all
> > > >     userspace memory fences (represented as dma_fence) to complete, with
> > > >     maybe some fudge to busy-spin until we've reached the actual end of the
> > > >     ringbuffer (maybe you have a IB tail there after the memory fence write,
> > > >     we have that on intel hw), and it waits for the memory to get
> > > >     "unloaded". This is either preemption, or pte clearing + tlb shootdown,
> > > >     or whatever else your hw provides which is a) used for dynamic memory
> > > >     management b) fast enough for actual memory management.
> > > > 
> > > > - any time a context dies we force-complete all it's pending fences,
> > > >     in-order ofc
> > > > 
> > > > So from hw pov this looks 99% like direct userspace submit, with the exact
> > > > same mappings, command sequences and everything else. The only difference
> > > > is that the rinbuffer head/tail updates happen from drm/scheduler, instead
> > > > of directly from userspace.
> > > > 
> > > > None of this stuff needs funny tricks where the kernel controls the
> > > > writes to memory fences, or where you need kernel ringbuffers, or anything
> > > > like thist. Userspace is allowed to do anything stupid, the rules are
> > > > guaranteed with:
> > > > 
> > > > - we rely on the hw isolation features to work, but _exactly_ like compute
> > > >     direct submit would too
> > > > 
> > > > - dying on any page fault captures memory management issues
> > > > 
> > > > - dying (without kernel recover, this is up to userspace if it cares) on
> > > >     any tdr makes sure fences complete still
> > > > 
> > > > > That syncfile and all that Android stuff isn't working out of the box with
> > > > > the new shiny user queue submission model (which in turn is mostly because
> > > > > of Windows) already raised some eyebrows here.
> > > > I think if you really want to make sure the current linux stack doesn't
> > > > break the _only_ option you have is provide a ctx mode that allows
> > > > dma_fence and drm/scheduler to be used like today.
> > > Yeah, but I still can just tell our hw/fw guys that we really really need to
> > > keep kernel queues or the whole Linux/Android infrastructure needs to get a
> > > compatibility layer like you describe above.
> > > 
> > > > For everything else it sounds you're a few years too late, because even
> > > > just huge kernel changes wont happen in time. Much less rewriting
> > > > userspace protocols.
> > > Seconded, question is rather if we are going to start migrating at some
> > > point or if we should keep pushing on our hw/fw guys.
> > So from what I'm hearing other hw might gain the sw compat layer too. Plus
> > I'm hoping that with the sw compat layer it'd be easier to smooth over
> > userspace to the new model (because there will be a long time where we
> > have to support both, maybe even with a runtime switch from userspace
> > memory fences to dma_fence kernel stuff).
> > 
> > But in the end it's up to you what makes more sense between sw work and
> > hw/fw work involved.
> 
> I'm currently entertaining my head a bit with the idea of implementing the
> HW scheduling on the CPU.
> 
> The only obstacle that I can really see is that we might need to unmap a
> page in the CPU page table when a queue becomes idle.
> 
> But apart from that it would give us the required functionality, just
> without the hardware scheduler.
> 
> Christian.
> 
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-09 13:19                               ` Daniel Vetter
@ 2021-06-09 13:58                                 ` Christian König
  2021-06-09 18:31                                   ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-09 13:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> [SNIP]
>> Yeah, we call this the lightweight and the heavyweight tlb flush.
>>
>> The lighweight can be used when you are sure that you don't have any of the
>> PTEs currently in flight in the 3D/DMA engine and you just need to
>> invalidate the TLB.
>>
>> The heavyweight must be used when you need to invalidate the TLB *AND* make
>> sure that no concurrently operation moves new stuff into the TLB.
>>
>> The problem is for this use case we have to use the heavyweight one.
> Just for my own curiosity: So the lightweight flush is only for in-between
> CS when you know access is idle? Or does that also not work if userspace
> has a CS on a dma engine going at the same time because the tlb aren't
> isolated enough between engines?

More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't 
take care of entries which have been handed out to the different engines.

In other words what can happen is the following:

1. Shader asks TLB to resolve address X.
2. TLB looks into its cache and can't find address X so it asks the 
walker to resolve.
3. Walker comes back with result for address X and TLB puts that into 
its cache and gives it to Shader.
4. Shader starts doing some operation using result for address X.
5. You send lightweight TLB invalidate and TLB throws away cached values 
for address X.
6. Shader happily still uses whatever the TLB gave to it in step 3 to 
accesses address X

See it like the shader has their own 1 entry L0 TLB cache which is not 
affected by the lightweight flush.

The heavyweight flush on the other hand sends out a broadcast signal to 
everybody and only comes back when we are sure that an address is not in 
use any more.

Christian.

> -Daniel
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-09 13:58                                 ` Christian König
@ 2021-06-09 18:31                                   ` Daniel Vetter
  2021-06-10 15:59                                     ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-09 18:31 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > 
> > > The lighweight can be used when you are sure that you don't have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > invalidate the TLB.
> > > 
> > > The heavyweight must be used when you need to invalidate the TLB *AND* make
> > > sure that no concurrently operation moves new stuff into the TLB.
> > > 
> > > The problem is for this use case we have to use the heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only for in-between
> > CS when you know access is idle? Or does that also not work if userspace
> > has a CS on a dma engine going at the same time because the tlb aren't
> > isolated enough between engines?
> 
> More or less correct, yes.
> 
> The problem is a lightweight flush only invalidates the TLB, but doesn't
> take care of entries which have been handed out to the different engines.
> 
> In other words what can happen is the following:
> 
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> accesses address X
> 
> See it like the shader has their own 1 entry L0 TLB cache which is not
> affected by the lightweight flush.
> 
> The heavyweight flush on the other hand sends out a broadcast signal to
> everybody and only comes back when we are sure that an address is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes
around as explicit async messages to IO blocks. So we don't have this, the
only difference in tlb flushes is between tlb flush in the IB and an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-09 18:31                                   ` Daniel Vetter
@ 2021-06-10 15:59                                     ` Marek Olšák
  2021-06-10 16:33                                       ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-10 15:59 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 3316 bytes --]

Hi Daniel,

We just talked about this whole topic internally and we came up to the
conclusion that the hardware needs to understand sync object handles and
have high-level wait and signal operations in the command stream. Sync
objects will be backed by memory, but they won't be readable or writable by
processes directly. The hardware will log all accesses to sync objects and
will send the log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > [SNIP]
> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > >
> > > > The lighweight can be used when you are sure that you don't have any
> of the
> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > > invalidate the TLB.
> > > >
> > > > The heavyweight must be used when you need to invalidate the TLB
> *AND* make
> > > > sure that no concurrently operation moves new stuff into the TLB.
> > > >
> > > > The problem is for this use case we have to use the heavyweight one.
> > > Just for my own curiosity: So the lightweight flush is only for
> in-between
> > > CS when you know access is idle? Or does that also not work if
> userspace
> > > has a CS on a dma engine going at the same time because the tlb aren't
> > > isolated enough between engines?
> >
> > More or less correct, yes.
> >
> > The problem is a lightweight flush only invalidates the TLB, but doesn't
> > take care of entries which have been handed out to the different engines.
> >
> > In other words what can happen is the following:
> >
> > 1. Shader asks TLB to resolve address X.
> > 2. TLB looks into its cache and can't find address X so it asks the
> walker
> > to resolve.
> > 3. Walker comes back with result for address X and TLB puts that into its
> > cache and gives it to Shader.
> > 4. Shader starts doing some operation using result for address X.
> > 5. You send lightweight TLB invalidate and TLB throws away cached values
> for
> > address X.
> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> > accesses address X
> >
> > See it like the shader has their own 1 entry L0 TLB cache which is not
> > affected by the lightweight flush.
> >
> > The heavyweight flush on the other hand sends out a broadcast signal to
> > everybody and only comes back when we are sure that an address is not in
> use
> > any more.
>
> Ah makes sense. On intel the shaders only operate in VA, everything goes
> around as explicit async messages to IO blocks. So we don't have this, the
> only difference in tlb flushes is between tlb flush in the IB and an mmio
> one which is independent for anything currently being executed on an
> egine.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #2: Type: text/html, Size: 4136 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-10 15:59                                     ` Marek Olšák
@ 2021-06-10 16:33                                       ` Christian König
  2021-06-14 17:10                                         ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-10 16:33 UTC (permalink / raw)
  To: Marek Olšák, Daniel Vetter
  Cc: Jason Ekstrand, ML Mesa-dev, Michel Dänzer, dri-devel

[-- Attachment #1: Type: text/plain, Size: 4870 bytes --]

Hi guys,

maybe soften that a bit. Reading from the shared memory of the user 
fence is ok for everybody. What we need to take more care of is the 
writing side.

So my current thinking is that we allow read only access, but writing a 
new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call an 
IOCTL. When the GPU wants to signal the timeline fence it needs to hand 
that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the last 
write and what value was written.

That together with an IOCTL to give out sequence number for implicit 
sync to applications should be sufficient for the kernel to track who is 
responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff like 
0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that.

If the hardware says that seq - 1 was written fine, but seq is missing 
then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be fine to 
make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:
> Hi Daniel,
>
> We just talked about this whole topic internally and we came up to the 
> conclusion that the hardware needs to understand sync object handles 
> and have high-level wait and signal operations in the command stream. 
> Sync objects will be backed by memory, but they won't be readable or 
> writable by processes directly. The hardware will log all accesses to 
> sync objects and will send the log to the kernel periodically. The 
> kernel will identify malicious behavior.
>
> Example of a hardware command stream:
> ...
> ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence 
> number is assigned by the kernel
> Draw();
> ImplicitSyncSignalWhenDone(syncObjHandle);
> ...
>
> I'm afraid we have no other choice because of the TLB invalidation 
> overhead.
>
> Marek
>
>
> On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch 
> <mailto:daniel@ffwll.ch>> wrote:
>
>     On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>     > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>     > > [SNIP]
>     > > > Yeah, we call this the lightweight and the heavyweight tlb
>     flush.
>     > > >
>     > > > The lighweight can be used when you are sure that you don't
>     have any of the
>     > > > PTEs currently in flight in the 3D/DMA engine and you just
>     need to
>     > > > invalidate the TLB.
>     > > >
>     > > > The heavyweight must be used when you need to invalidate the
>     TLB *AND* make
>     > > > sure that no concurrently operation moves new stuff into the
>     TLB.
>     > > >
>     > > > The problem is for this use case we have to use the
>     heavyweight one.
>     > > Just for my own curiosity: So the lightweight flush is only
>     for in-between
>     > > CS when you know access is idle? Or does that also not work if
>     userspace
>     > > has a CS on a dma engine going at the same time because the
>     tlb aren't
>     > > isolated enough between engines?
>     >
>     > More or less correct, yes.
>     >
>     > The problem is a lightweight flush only invalidates the TLB, but
>     doesn't
>     > take care of entries which have been handed out to the different
>     engines.
>     >
>     > In other words what can happen is the following:
>     >
>     > 1. Shader asks TLB to resolve address X.
>     > 2. TLB looks into its cache and can't find address X so it asks
>     the walker
>     > to resolve.
>     > 3. Walker comes back with result for address X and TLB puts that
>     into its
>     > cache and gives it to Shader.
>     > 4. Shader starts doing some operation using result for address X.
>     > 5. You send lightweight TLB invalidate and TLB throws away
>     cached values for
>     > address X.
>     > 6. Shader happily still uses whatever the TLB gave to it in step
>     3 to
>     > accesses address X
>     >
>     > See it like the shader has their own 1 entry L0 TLB cache which
>     is not
>     > affected by the lightweight flush.
>     >
>     > The heavyweight flush on the other hand sends out a broadcast
>     signal to
>     > everybody and only comes back when we are sure that an address
>     is not in use
>     > any more.
>
>     Ah makes sense. On intel the shaders only operate in VA,
>     everything goes
>     around as explicit async messages to IO blocks. So we don't have
>     this, the
>     only difference in tlb flushes is between tlb flush in the IB and
>     an mmio
>     one which is independent for anything currently being executed on an
>     egine.
>     -Daniel
>     -- 
>     Daniel Vetter
>     Software Engineer, Intel Corporation
>     http://blog.ffwll.ch <http://blog.ffwll.ch>
>


[-- Attachment #2: Type: text/html, Size: 7073 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-10 16:33                                       ` Christian König
@ 2021-06-14 17:10                                         ` Marek Olšák
  2021-06-14 17:13                                           ` Christian König
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-14 17:10 UTC (permalink / raw)
  To: Christian König
  Cc: Michel Dänzer, dri-devel, Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 5535 bytes --]

The call to the hw scheduler has a limitation on the size of all parameters
combined. I think we can only pass a 32-bit sequence number and a ~16-bit
global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU) syncobj
table and it's read only for all processes with the exception of the signal
command. Syncobjs can either have per VMID write access flags for the
signal command (slow), or any process can write to any syncobjs and only
rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only
use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Hi guys,
>
> maybe soften that a bit. Reading from the shared memory of the user fence
> is ok for everybody. What we need to take more care of is the writing side.
>
> So my current thinking is that we allow read only access, but writing a
> new sequence value needs to go through the scheduler/kernel.
>
> So when the CPU wants to signal a timeline fence it needs to call an
> IOCTL. When the GPU wants to signal the timeline fence it needs to hand
> that of to the hardware scheduler.
>
> If we lockup the kernel can check with the hardware who did the last write
> and what value was written.
>
> That together with an IOCTL to give out sequence number for implicit sync
> to applications should be sufficient for the kernel to track who is
> responsible if something bad happens.
>
> In other words when the hardware says that the shader wrote stuff like
> 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that.
>
> If the hardware says that seq - 1 was written fine, but seq is missing
> then the kernel blames whoever was supposed to write seq.
>
> Just pieping the write through a privileged instance should be fine to
> make sure that we don't run into issues.
>
> Christian.
>
> Am 10.06.21 um 17:59 schrieb Marek Olšák:
>
> Hi Daniel,
>
> We just talked about this whole topic internally and we came up to the
> conclusion that the hardware needs to understand sync object handles and
> have high-level wait and signal operations in the command stream. Sync
> objects will be backed by memory, but they won't be readable or writable by
> processes directly. The hardware will log all accesses to sync objects and
> will send the log to the kernel periodically. The kernel will identify
> malicious behavior.
>
> Example of a hardware command stream:
> ...
> ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
> assigned by the kernel
> Draw();
> ImplicitSyncSignalWhenDone(syncObjHandle);
> ...
>
> I'm afraid we have no other choice because of the TLB invalidation
> overhead.
>
> Marek
>
>
> On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
>> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>> > > [SNIP]
>> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
>> > > >
>> > > > The lighweight can be used when you are sure that you don't have
>> any of the
>> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
>> > > > invalidate the TLB.
>> > > >
>> > > > The heavyweight must be used when you need to invalidate the TLB
>> *AND* make
>> > > > sure that no concurrently operation moves new stuff into the TLB.
>> > > >
>> > > > The problem is for this use case we have to use the heavyweight one.
>> > > Just for my own curiosity: So the lightweight flush is only for
>> in-between
>> > > CS when you know access is idle? Or does that also not work if
>> userspace
>> > > has a CS on a dma engine going at the same time because the tlb aren't
>> > > isolated enough between engines?
>> >
>> > More or less correct, yes.
>> >
>> > The problem is a lightweight flush only invalidates the TLB, but doesn't
>> > take care of entries which have been handed out to the different
>> engines.
>> >
>> > In other words what can happen is the following:
>> >
>> > 1. Shader asks TLB to resolve address X.
>> > 2. TLB looks into its cache and can't find address X so it asks the
>> walker
>> > to resolve.
>> > 3. Walker comes back with result for address X and TLB puts that into
>> its
>> > cache and gives it to Shader.
>> > 4. Shader starts doing some operation using result for address X.
>> > 5. You send lightweight TLB invalidate and TLB throws away cached
>> values for
>> > address X.
>> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
>> > accesses address X
>> >
>> > See it like the shader has their own 1 entry L0 TLB cache which is not
>> > affected by the lightweight flush.
>> >
>> > The heavyweight flush on the other hand sends out a broadcast signal to
>> > everybody and only comes back when we are sure that an address is not
>> in use
>> > any more.
>>
>> Ah makes sense. On intel the shaders only operate in VA, everything goes
>> around as explicit async messages to IO blocks. So we don't have this, the
>> only difference in tlb flushes is between tlb flush in the IB and an mmio
>> one which is independent for anything currently being executed on an
>> egine.
>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>>
>
>

[-- Attachment #2: Type: text/html, Size: 8131 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-14 17:10                                         ` Marek Olšák
@ 2021-06-14 17:13                                           ` Christian König
  2021-06-17 16:48                                             ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Christian König @ 2021-06-14 17:13 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Michel Dänzer, dri-devel, Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 6538 bytes --]

As long as we can figure out who touched to a certain sync object last 
that would indeed work, yes.

Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:
> The call to the hw scheduler has a limitation on the size of all 
> parameters combined. I think we can only pass a 32-bit sequence number 
> and a ~16-bit global (per-GPU) syncobj handle in one call and not much 
> else.
>
> The syncobj handle can be an element index in a global (per-GPU) 
> syncobj table and it's read only for all processes with the exception 
> of the signal command. Syncobjs can either have per VMID write access 
> flags for the signal command (slow), or any process can write to any 
> syncobjs and only rely on the kernel checking the write log (fast).
>
> In any case, we can execute the memory write in the queue engine and 
> only use the hw scheduler for logging, which would be perfect.
>
> Marek
>
> On Thu, Jun 10, 2021 at 12:33 PM Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     Hi guys,
>
>     maybe soften that a bit. Reading from the shared memory of the
>     user fence is ok for everybody. What we need to take more care of
>     is the writing side.
>
>     So my current thinking is that we allow read only access, but
>     writing a new sequence value needs to go through the scheduler/kernel.
>
>     So when the CPU wants to signal a timeline fence it needs to call
>     an IOCTL. When the GPU wants to signal the timeline fence it needs
>     to hand that of to the hardware scheduler.
>
>     If we lockup the kernel can check with the hardware who did the
>     last write and what value was written.
>
>     That together with an IOCTL to give out sequence number for
>     implicit sync to applications should be sufficient for the kernel
>     to track who is responsible if something bad happens.
>
>     In other words when the hardware says that the shader wrote stuff
>     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
>     who did that.
>
>     If the hardware says that seq - 1 was written fine, but seq is
>     missing then the kernel blames whoever was supposed to write seq.
>
>     Just pieping the write through a privileged instance should be
>     fine to make sure that we don't run into issues.
>
>     Christian.
>
>     Am 10.06.21 um 17:59 schrieb Marek Olšák:
>>     Hi Daniel,
>>
>>     We just talked about this whole topic internally and we came up
>>     to the conclusion that the hardware needs to understand sync
>>     object handles and have high-level wait and signal operations in
>>     the command stream. Sync objects will be backed by memory, but
>>     they won't be readable or writable by processes directly. The
>>     hardware will log all accesses to sync objects and will send the
>>     log to the kernel periodically. The kernel will identify
>>     malicious behavior.
>>
>>     Example of a hardware command stream:
>>     ...
>>     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
>>     number is assigned by the kernel
>>     Draw();
>>     ImplicitSyncSignalWhenDone(syncObjHandle);
>>     ...
>>
>>     I'm afraid we have no other choice because of the TLB
>>     invalidation overhead.
>>
>>     Marek
>>
>>
>>     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
>>     <mailto:daniel@ffwll.ch>> wrote:
>>
>>         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>>         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>>         > > [SNIP]
>>         > > > Yeah, we call this the lightweight and the heavyweight
>>         tlb flush.
>>         > > >
>>         > > > The lighweight can be used when you are sure that you
>>         don't have any of the
>>         > > > PTEs currently in flight in the 3D/DMA engine and you
>>         just need to
>>         > > > invalidate the TLB.
>>         > > >
>>         > > > The heavyweight must be used when you need to
>>         invalidate the TLB *AND* make
>>         > > > sure that no concurrently operation moves new stuff
>>         into the TLB.
>>         > > >
>>         > > > The problem is for this use case we have to use the
>>         heavyweight one.
>>         > > Just for my own curiosity: So the lightweight flush is
>>         only for in-between
>>         > > CS when you know access is idle? Or does that also not
>>         work if userspace
>>         > > has a CS on a dma engine going at the same time because
>>         the tlb aren't
>>         > > isolated enough between engines?
>>         >
>>         > More or less correct, yes.
>>         >
>>         > The problem is a lightweight flush only invalidates the
>>         TLB, but doesn't
>>         > take care of entries which have been handed out to the
>>         different engines.
>>         >
>>         > In other words what can happen is the following:
>>         >
>>         > 1. Shader asks TLB to resolve address X.
>>         > 2. TLB looks into its cache and can't find address X so it
>>         asks the walker
>>         > to resolve.
>>         > 3. Walker comes back with result for address X and TLB puts
>>         that into its
>>         > cache and gives it to Shader.
>>         > 4. Shader starts doing some operation using result for
>>         address X.
>>         > 5. You send lightweight TLB invalidate and TLB throws away
>>         cached values for
>>         > address X.
>>         > 6. Shader happily still uses whatever the TLB gave to it in
>>         step 3 to
>>         > accesses address X
>>         >
>>         > See it like the shader has their own 1 entry L0 TLB cache
>>         which is not
>>         > affected by the lightweight flush.
>>         >
>>         > The heavyweight flush on the other hand sends out a
>>         broadcast signal to
>>         > everybody and only comes back when we are sure that an
>>         address is not in use
>>         > any more.
>>
>>         Ah makes sense. On intel the shaders only operate in VA,
>>         everything goes
>>         around as explicit async messages to IO blocks. So we don't
>>         have this, the
>>         only difference in tlb flushes is between tlb flush in the IB
>>         and an mmio
>>         one which is independent for anything currently being
>>         executed on an
>>         egine.
>>         -Daniel
>>         -- 
>>         Daniel Vetter
>>         Software Engineer, Intel Corporation
>>         http://blog.ffwll.ch <http://blog.ffwll.ch>
>>
>


[-- Attachment #2: Type: text/html, Size: 10289 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-14 17:13                                           ` Christian König
@ 2021-06-17 16:48                                             ` Daniel Vetter
  2021-06-17 18:28                                               ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-17 16:48 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> As long as we can figure out who touched to a certain sync object last that
> would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your
fence? Or maybe I'm just again totally confused.
-Daniel

> 
> Christian.
> 
> Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > The call to the hw scheduler has a limitation on the size of all
> > parameters combined. I think we can only pass a 32-bit sequence number
> > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > else.
> > 
> > The syncobj handle can be an element index in a global (per-GPU) syncobj
> > table and it's read only for all processes with the exception of the
> > signal command. Syncobjs can either have per VMID write access flags for
> > the signal command (slow), or any process can write to any syncobjs and
> > only rely on the kernel checking the write log (fast).
> > 
> > In any case, we can execute the memory write in the queue engine and
> > only use the hw scheduler for logging, which would be perfect.
> > 
> > Marek
> > 
> > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com
> > <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> > 
> >     Hi guys,
> > 
> >     maybe soften that a bit. Reading from the shared memory of the
> >     user fence is ok for everybody. What we need to take more care of
> >     is the writing side.
> > 
> >     So my current thinking is that we allow read only access, but
> >     writing a new sequence value needs to go through the scheduler/kernel.
> > 
> >     So when the CPU wants to signal a timeline fence it needs to call
> >     an IOCTL. When the GPU wants to signal the timeline fence it needs
> >     to hand that of to the hardware scheduler.
> > 
> >     If we lockup the kernel can check with the hardware who did the
> >     last write and what value was written.
> > 
> >     That together with an IOCTL to give out sequence number for
> >     implicit sync to applications should be sufficient for the kernel
> >     to track who is responsible if something bad happens.
> > 
> >     In other words when the hardware says that the shader wrote stuff
> >     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
> >     who did that.
> > 
> >     If the hardware says that seq - 1 was written fine, but seq is
> >     missing then the kernel blames whoever was supposed to write seq.
> > 
> >     Just pieping the write through a privileged instance should be
> >     fine to make sure that we don't run into issues.
> > 
> >     Christian.
> > 
> >     Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > >     Hi Daniel,
> > > 
> > >     We just talked about this whole topic internally and we came up
> > >     to the conclusion that the hardware needs to understand sync
> > >     object handles and have high-level wait and signal operations in
> > >     the command stream. Sync objects will be backed by memory, but
> > >     they won't be readable or writable by processes directly. The
> > >     hardware will log all accesses to sync objects and will send the
> > >     log to the kernel periodically. The kernel will identify
> > >     malicious behavior.
> > > 
> > >     Example of a hardware command stream:
> > >     ...
> > >     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > >     number is assigned by the kernel
> > >     Draw();
> > >     ImplicitSyncSignalWhenDone(syncObjHandle);
> > >     ...
> > > 
> > >     I'm afraid we have no other choice because of the TLB
> > >     invalidation overhead.
> > > 
> > >     Marek
> > > 
> > > 
> > >     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
> > >     <mailto:daniel@ffwll.ch>> wrote:
> > > 
> > >         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > >         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > >         > > [SNIP]
> > >         > > > Yeah, we call this the lightweight and the heavyweight
> > >         tlb flush.
> > >         > > >
> > >         > > > The lighweight can be used when you are sure that you
> > >         don't have any of the
> > >         > > > PTEs currently in flight in the 3D/DMA engine and you
> > >         just need to
> > >         > > > invalidate the TLB.
> > >         > > >
> > >         > > > The heavyweight must be used when you need to
> > >         invalidate the TLB *AND* make
> > >         > > > sure that no concurrently operation moves new stuff
> > >         into the TLB.
> > >         > > >
> > >         > > > The problem is for this use case we have to use the
> > >         heavyweight one.
> > >         > > Just for my own curiosity: So the lightweight flush is
> > >         only for in-between
> > >         > > CS when you know access is idle? Or does that also not
> > >         work if userspace
> > >         > > has a CS on a dma engine going at the same time because
> > >         the tlb aren't
> > >         > > isolated enough between engines?
> > >         >
> > >         > More or less correct, yes.
> > >         >
> > >         > The problem is a lightweight flush only invalidates the
> > >         TLB, but doesn't
> > >         > take care of entries which have been handed out to the
> > >         different engines.
> > >         >
> > >         > In other words what can happen is the following:
> > >         >
> > >         > 1. Shader asks TLB to resolve address X.
> > >         > 2. TLB looks into its cache and can't find address X so it
> > >         asks the walker
> > >         > to resolve.
> > >         > 3. Walker comes back with result for address X and TLB puts
> > >         that into its
> > >         > cache and gives it to Shader.
> > >         > 4. Shader starts doing some operation using result for
> > >         address X.
> > >         > 5. You send lightweight TLB invalidate and TLB throws away
> > >         cached values for
> > >         > address X.
> > >         > 6. Shader happily still uses whatever the TLB gave to it in
> > >         step 3 to
> > >         > accesses address X
> > >         >
> > >         > See it like the shader has their own 1 entry L0 TLB cache
> > >         which is not
> > >         > affected by the lightweight flush.
> > >         >
> > >         > The heavyweight flush on the other hand sends out a
> > >         broadcast signal to
> > >         > everybody and only comes back when we are sure that an
> > >         address is not in use
> > >         > any more.
> > > 
> > >         Ah makes sense. On intel the shaders only operate in VA,
> > >         everything goes
> > >         around as explicit async messages to IO blocks. So we don't
> > >         have this, the
> > >         only difference in tlb flushes is between tlb flush in the IB
> > >         and an mmio
> > >         one which is independent for anything currently being
> > >         executed on an
> > >         egine.
> > >         -Daniel
> > >         --         Daniel Vetter
> > >         Software Engineer, Intel Corporation
> > >         http://blog.ffwll.ch <http://blog.ffwll.ch>
> > > 
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-17 16:48                                             ` Daniel Vetter
@ 2021-06-17 18:28                                               ` Marek Olšák
  2021-06-17 19:04                                                 ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Marek Olšák @ 2021-06-17 18:28 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 8323 bytes --]

The kernel will know who should touch the implicit-sync semaphore next, and
at the same time, the copy of all write requests to the implicit-sync
semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be
completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go
idle before converting to a syncfile.

Marek



On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > As long as we can figure out who touched to a certain sync object last
> that
> > would indeed work, yes.
>
> Don't you need to know who will touch it next, i.e. who is holding up your
> fence? Or maybe I'm just again totally confused.
> -Daniel
>
> >
> > Christian.
> >
> > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > The call to the hw scheduler has a limitation on the size of all
> > > parameters combined. I think we can only pass a 32-bit sequence number
> > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > else.
> > >
> > > The syncobj handle can be an element index in a global (per-GPU)
> syncobj
> > > table and it's read only for all processes with the exception of the
> > > signal command. Syncobjs can either have per VMID write access flags
> for
> > > the signal command (slow), or any process can write to any syncobjs and
> > > only rely on the kernel checking the write log (fast).
> > >
> > > In any case, we can execute the memory write in the queue engine and
> > > only use the hw scheduler for logging, which would be perfect.
> > >
> > > Marek
> > >
> > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > <ckoenig.leichtzumerken@gmail.com
> > > <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> > >
> > >     Hi guys,
> > >
> > >     maybe soften that a bit. Reading from the shared memory of the
> > >     user fence is ok for everybody. What we need to take more care of
> > >     is the writing side.
> > >
> > >     So my current thinking is that we allow read only access, but
> > >     writing a new sequence value needs to go through the
> scheduler/kernel.
> > >
> > >     So when the CPU wants to signal a timeline fence it needs to call
> > >     an IOCTL. When the GPU wants to signal the timeline fence it needs
> > >     to hand that of to the hardware scheduler.
> > >
> > >     If we lockup the kernel can check with the hardware who did the
> > >     last write and what value was written.
> > >
> > >     That together with an IOCTL to give out sequence number for
> > >     implicit sync to applications should be sufficient for the kernel
> > >     to track who is responsible if something bad happens.
> > >
> > >     In other words when the hardware says that the shader wrote stuff
> > >     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
> > >     who did that.
> > >
> > >     If the hardware says that seq - 1 was written fine, but seq is
> > >     missing then the kernel blames whoever was supposed to write seq.
> > >
> > >     Just pieping the write through a privileged instance should be
> > >     fine to make sure that we don't run into issues.
> > >
> > >     Christian.
> > >
> > >     Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > >     Hi Daniel,
> > > >
> > > >     We just talked about this whole topic internally and we came up
> > > >     to the conclusion that the hardware needs to understand sync
> > > >     object handles and have high-level wait and signal operations in
> > > >     the command stream. Sync objects will be backed by memory, but
> > > >     they won't be readable or writable by processes directly. The
> > > >     hardware will log all accesses to sync objects and will send the
> > > >     log to the kernel periodically. The kernel will identify
> > > >     malicious behavior.
> > > >
> > > >     Example of a hardware command stream:
> > > >     ...
> > > >     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > >     number is assigned by the kernel
> > > >     Draw();
> > > >     ImplicitSyncSignalWhenDone(syncObjHandle);
> > > >     ...
> > > >
> > > >     I'm afraid we have no other choice because of the TLB
> > > >     invalidation overhead.
> > > >
> > > >     Marek
> > > >
> > > >
> > > >     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
> > > >     <mailto:daniel@ffwll.ch>> wrote:
> > > >
> > > >         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
> wrote:
> > > >         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > >         > > [SNIP]
> > > >         > > > Yeah, we call this the lightweight and the heavyweight
> > > >         tlb flush.
> > > >         > > >
> > > >         > > > The lighweight can be used when you are sure that you
> > > >         don't have any of the
> > > >         > > > PTEs currently in flight in the 3D/DMA engine and you
> > > >         just need to
> > > >         > > > invalidate the TLB.
> > > >         > > >
> > > >         > > > The heavyweight must be used when you need to
> > > >         invalidate the TLB *AND* make
> > > >         > > > sure that no concurrently operation moves new stuff
> > > >         into the TLB.
> > > >         > > >
> > > >         > > > The problem is for this use case we have to use the
> > > >         heavyweight one.
> > > >         > > Just for my own curiosity: So the lightweight flush is
> > > >         only for in-between
> > > >         > > CS when you know access is idle? Or does that also not
> > > >         work if userspace
> > > >         > > has a CS on a dma engine going at the same time because
> > > >         the tlb aren't
> > > >         > > isolated enough between engines?
> > > >         >
> > > >         > More or less correct, yes.
> > > >         >
> > > >         > The problem is a lightweight flush only invalidates the
> > > >         TLB, but doesn't
> > > >         > take care of entries which have been handed out to the
> > > >         different engines.
> > > >         >
> > > >         > In other words what can happen is the following:
> > > >         >
> > > >         > 1. Shader asks TLB to resolve address X.
> > > >         > 2. TLB looks into its cache and can't find address X so it
> > > >         asks the walker
> > > >         > to resolve.
> > > >         > 3. Walker comes back with result for address X and TLB puts
> > > >         that into its
> > > >         > cache and gives it to Shader.
> > > >         > 4. Shader starts doing some operation using result for
> > > >         address X.
> > > >         > 5. You send lightweight TLB invalidate and TLB throws away
> > > >         cached values for
> > > >         > address X.
> > > >         > 6. Shader happily still uses whatever the TLB gave to it in
> > > >         step 3 to
> > > >         > accesses address X
> > > >         >
> > > >         > See it like the shader has their own 1 entry L0 TLB cache
> > > >         which is not
> > > >         > affected by the lightweight flush.
> > > >         >
> > > >         > The heavyweight flush on the other hand sends out a
> > > >         broadcast signal to
> > > >         > everybody and only comes back when we are sure that an
> > > >         address is not in use
> > > >         > any more.
> > > >
> > > >         Ah makes sense. On intel the shaders only operate in VA,
> > > >         everything goes
> > > >         around as explicit async messages to IO blocks. So we don't
> > > >         have this, the
> > > >         only difference in tlb flushes is between tlb flush in the IB
> > > >         and an mmio
> > > >         one which is independent for anything currently being
> > > >         executed on an
> > > >         egine.
> > > >         -Daniel
> > > >         --         Daniel Vetter
> > > >         Software Engineer, Intel Corporation
> > > >         http://blog.ffwll.ch <http://blog.ffwll.ch>
> > > >
> > >
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #2: Type: text/html, Size: 11577 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-17 18:28                                               ` Marek Olšák
@ 2021-06-17 19:04                                                 ` Daniel Vetter
  2021-06-17 19:23                                                   ` Marek Olšák
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2021-06-17 19:04 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Jason Ekstrand, Christian König, Michel Dänzer,
	dri-devel, ML Mesa-dev

On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> The kernel will know who should touch the implicit-sync semaphore next, and
> at the same time, the copy of all write requests to the implicit-sync
> semaphore will be forwarded to the kernel for monitoring and bo_wait.
> 
> Syncobjs could either use the same monitored access as implicit sync or be
> completely unmonitored. We haven't decided yet.
> 
> Syncfiles could either use one of the above or wait for a syncobj to go
> idle before converting to a syncfile.

Hm this sounds all like you're planning to completely rewrap everything
... I'm assuming the plan is still that this is going to be largely
wrapped in dma_fence? Maybe with timeline objects being a bit more
optimized, but I'm not sure how much you can optimize without breaking the
interfaces.
-Daniel

> 
> Marek
> 
> 
> 
> On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > As long as we can figure out who touched to a certain sync object last
> > that
> > > would indeed work, yes.
> >
> > Don't you need to know who will touch it next, i.e. who is holding up your
> > fence? Or maybe I'm just again totally confused.
> > -Daniel
> >
> > >
> > > Christian.
> > >
> > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > The call to the hw scheduler has a limitation on the size of all
> > > > parameters combined. I think we can only pass a 32-bit sequence number
> > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > > else.
> > > >
> > > > The syncobj handle can be an element index in a global (per-GPU)
> > syncobj
> > > > table and it's read only for all processes with the exception of the
> > > > signal command. Syncobjs can either have per VMID write access flags
> > for
> > > > the signal command (slow), or any process can write to any syncobjs and
> > > > only rely on the kernel checking the write log (fast).
> > > >
> > > > In any case, we can execute the memory write in the queue engine and
> > > > only use the hw scheduler for logging, which would be perfect.
> > > >
> > > > Marek
> > > >
> > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > > <ckoenig.leichtzumerken@gmail.com
> > > > <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> > > >
> > > >     Hi guys,
> > > >
> > > >     maybe soften that a bit. Reading from the shared memory of the
> > > >     user fence is ok for everybody. What we need to take more care of
> > > >     is the writing side.
> > > >
> > > >     So my current thinking is that we allow read only access, but
> > > >     writing a new sequence value needs to go through the
> > scheduler/kernel.
> > > >
> > > >     So when the CPU wants to signal a timeline fence it needs to call
> > > >     an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > >     to hand that of to the hardware scheduler.
> > > >
> > > >     If we lockup the kernel can check with the hardware who did the
> > > >     last write and what value was written.
> > > >
> > > >     That together with an IOCTL to give out sequence number for
> > > >     implicit sync to applications should be sufficient for the kernel
> > > >     to track who is responsible if something bad happens.
> > > >
> > > >     In other words when the hardware says that the shader wrote stuff
> > > >     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
> > > >     who did that.
> > > >
> > > >     If the hardware says that seq - 1 was written fine, but seq is
> > > >     missing then the kernel blames whoever was supposed to write seq.
> > > >
> > > >     Just pieping the write through a privileged instance should be
> > > >     fine to make sure that we don't run into issues.
> > > >
> > > >     Christian.
> > > >
> > > >     Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > >     Hi Daniel,
> > > > >
> > > > >     We just talked about this whole topic internally and we came up
> > > > >     to the conclusion that the hardware needs to understand sync
> > > > >     object handles and have high-level wait and signal operations in
> > > > >     the command stream. Sync objects will be backed by memory, but
> > > > >     they won't be readable or writable by processes directly. The
> > > > >     hardware will log all accesses to sync objects and will send the
> > > > >     log to the kernel periodically. The kernel will identify
> > > > >     malicious behavior.
> > > > >
> > > > >     Example of a hardware command stream:
> > > > >     ...
> > > > >     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > >     number is assigned by the kernel
> > > > >     Draw();
> > > > >     ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > >     ...
> > > > >
> > > > >     I'm afraid we have no other choice because of the TLB
> > > > >     invalidation overhead.
> > > > >
> > > > >     Marek
> > > > >
> > > > >
> > > > >     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
> > > > >     <mailto:daniel@ffwll.ch>> wrote:
> > > > >
> > > > >         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
> > wrote:
> > > > >         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > >         > > [SNIP]
> > > > >         > > > Yeah, we call this the lightweight and the heavyweight
> > > > >         tlb flush.
> > > > >         > > >
> > > > >         > > > The lighweight can be used when you are sure that you
> > > > >         don't have any of the
> > > > >         > > > PTEs currently in flight in the 3D/DMA engine and you
> > > > >         just need to
> > > > >         > > > invalidate the TLB.
> > > > >         > > >
> > > > >         > > > The heavyweight must be used when you need to
> > > > >         invalidate the TLB *AND* make
> > > > >         > > > sure that no concurrently operation moves new stuff
> > > > >         into the TLB.
> > > > >         > > >
> > > > >         > > > The problem is for this use case we have to use the
> > > > >         heavyweight one.
> > > > >         > > Just for my own curiosity: So the lightweight flush is
> > > > >         only for in-between
> > > > >         > > CS when you know access is idle? Or does that also not
> > > > >         work if userspace
> > > > >         > > has a CS on a dma engine going at the same time because
> > > > >         the tlb aren't
> > > > >         > > isolated enough between engines?
> > > > >         >
> > > > >         > More or less correct, yes.
> > > > >         >
> > > > >         > The problem is a lightweight flush only invalidates the
> > > > >         TLB, but doesn't
> > > > >         > take care of entries which have been handed out to the
> > > > >         different engines.
> > > > >         >
> > > > >         > In other words what can happen is the following:
> > > > >         >
> > > > >         > 1. Shader asks TLB to resolve address X.
> > > > >         > 2. TLB looks into its cache and can't find address X so it
> > > > >         asks the walker
> > > > >         > to resolve.
> > > > >         > 3. Walker comes back with result for address X and TLB puts
> > > > >         that into its
> > > > >         > cache and gives it to Shader.
> > > > >         > 4. Shader starts doing some operation using result for
> > > > >         address X.
> > > > >         > 5. You send lightweight TLB invalidate and TLB throws away
> > > > >         cached values for
> > > > >         > address X.
> > > > >         > 6. Shader happily still uses whatever the TLB gave to it in
> > > > >         step 3 to
> > > > >         > accesses address X
> > > > >         >
> > > > >         > See it like the shader has their own 1 entry L0 TLB cache
> > > > >         which is not
> > > > >         > affected by the lightweight flush.
> > > > >         >
> > > > >         > The heavyweight flush on the other hand sends out a
> > > > >         broadcast signal to
> > > > >         > everybody and only comes back when we are sure that an
> > > > >         address is not in use
> > > > >         > any more.
> > > > >
> > > > >         Ah makes sense. On intel the shaders only operate in VA,
> > > > >         everything goes
> > > > >         around as explicit async messages to IO blocks. So we don't
> > > > >         have this, the
> > > > >         only difference in tlb flushes is between tlb flush in the IB
> > > > >         and an mmio
> > > > >         one which is independent for anything currently being
> > > > >         executed on an
> > > > >         egine.
> > > > >         -Daniel
> > > > >         --         Daniel Vetter
> > > > >         Software Engineer, Intel Corporation
> > > > >         http://blog.ffwll.ch <http://blog.ffwll.ch>
> > > > >
> > > >
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
  2021-06-17 19:04                                                 ` Daniel Vetter
@ 2021-06-17 19:23                                                   ` Marek Olšák
  0 siblings, 0 replies; 50+ messages in thread
From: Marek Olšák @ 2021-06-17 19:23 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

[-- Attachment #1: Type: text/plain, Size: 10194 bytes --]

Timeline semaphore waits (polling on memory) will be unmonitored and as
fast as the roundtrip to memory. Semaphore writes will be slower because
the copy of those write requests will also be forwarded to the kernel.
Arbitrary writes are not protected by the hw but the kernel will take
action against such behavior because it will receive them too.

I don't know if that would work with dma_fence.

Marek


On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> > The kernel will know who should touch the implicit-sync semaphore next,
> and
> > at the same time, the copy of all write requests to the implicit-sync
> > semaphore will be forwarded to the kernel for monitoring and bo_wait.
> >
> > Syncobjs could either use the same monitored access as implicit sync or
> be
> > completely unmonitored. We haven't decided yet.
> >
> > Syncfiles could either use one of the above or wait for a syncobj to go
> > idle before converting to a syncfile.
>
> Hm this sounds all like you're planning to completely rewrap everything
> ... I'm assuming the plan is still that this is going to be largely
> wrapped in dma_fence? Maybe with timeline objects being a bit more
> optimized, but I'm not sure how much you can optimize without breaking the
> interfaces.
> -Daniel
>
> >
> > Marek
> >
> >
> >
> > On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > > As long as we can figure out who touched to a certain sync object
> last
> > > that
> > > > would indeed work, yes.
> > >
> > > Don't you need to know who will touch it next, i.e. who is holding up
> your
> > > fence? Or maybe I'm just again totally confused.
> > > -Daniel
> > >
> > > >
> > > > Christian.
> > > >
> > > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > > The call to the hw scheduler has a limitation on the size of all
> > > > > parameters combined. I think we can only pass a 32-bit sequence
> number
> > > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not
> much
> > > > > else.
> > > > >
> > > > > The syncobj handle can be an element index in a global (per-GPU)
> > > syncobj
> > > > > table and it's read only for all processes with the exception of
> the
> > > > > signal command. Syncobjs can either have per VMID write access
> flags
> > > for
> > > > > the signal command (slow), or any process can write to any
> syncobjs and
> > > > > only rely on the kernel checking the write log (fast).
> > > > >
> > > > > In any case, we can execute the memory write in the queue engine
> and
> > > > > only use the hw scheduler for logging, which would be perfect.
> > > > >
> > > > > Marek
> > > > >
> > > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > > > <ckoenig.leichtzumerken@gmail.com
> > > > > <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> > > > >
> > > > >     Hi guys,
> > > > >
> > > > >     maybe soften that a bit. Reading from the shared memory of the
> > > > >     user fence is ok for everybody. What we need to take more care
> of
> > > > >     is the writing side.
> > > > >
> > > > >     So my current thinking is that we allow read only access, but
> > > > >     writing a new sequence value needs to go through the
> > > scheduler/kernel.
> > > > >
> > > > >     So when the CPU wants to signal a timeline fence it needs to
> call
> > > > >     an IOCTL. When the GPU wants to signal the timeline fence it
> needs
> > > > >     to hand that of to the hardware scheduler.
> > > > >
> > > > >     If we lockup the kernel can check with the hardware who did the
> > > > >     last write and what value was written.
> > > > >
> > > > >     That together with an IOCTL to give out sequence number for
> > > > >     implicit sync to applications should be sufficient for the
> kernel
> > > > >     to track who is responsible if something bad happens.
> > > > >
> > > > >     In other words when the hardware says that the shader wrote
> stuff
> > > > >     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the
> process
> > > > >     who did that.
> > > > >
> > > > >     If the hardware says that seq - 1 was written fine, but seq is
> > > > >     missing then the kernel blames whoever was supposed to write
> seq.
> > > > >
> > > > >     Just pieping the write through a privileged instance should be
> > > > >     fine to make sure that we don't run into issues.
> > > > >
> > > > >     Christian.
> > > > >
> > > > >     Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > > >     Hi Daniel,
> > > > > >
> > > > > >     We just talked about this whole topic internally and we came
> up
> > > > > >     to the conclusion that the hardware needs to understand sync
> > > > > >     object handles and have high-level wait and signal
> operations in
> > > > > >     the command stream. Sync objects will be backed by memory,
> but
> > > > > >     they won't be readable or writable by processes directly. The
> > > > > >     hardware will log all accesses to sync objects and will send
> the
> > > > > >     log to the kernel periodically. The kernel will identify
> > > > > >     malicious behavior.
> > > > > >
> > > > > >     Example of a hardware command stream:
> > > > > >     ...
> > > > > >     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the
> sequence
> > > > > >     number is assigned by the kernel
> > > > > >     Draw();
> > > > > >     ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > > >     ...
> > > > > >
> > > > > >     I'm afraid we have no other choice because of the TLB
> > > > > >     invalidation overhead.
> > > > > >
> > > > > >     Marek
> > > > > >
> > > > > >
> > > > > >     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <
> daniel@ffwll.ch
> > > > > >     <mailto:daniel@ffwll.ch>> wrote:
> > > > > >
> > > > > >         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
> > > wrote:
> > > > > >         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > >         > > [SNIP]
> > > > > >         > > > Yeah, we call this the lightweight and the
> heavyweight
> > > > > >         tlb flush.
> > > > > >         > > >
> > > > > >         > > > The lighweight can be used when you are sure that
> you
> > > > > >         don't have any of the
> > > > > >         > > > PTEs currently in flight in the 3D/DMA engine and
> you
> > > > > >         just need to
> > > > > >         > > > invalidate the TLB.
> > > > > >         > > >
> > > > > >         > > > The heavyweight must be used when you need to
> > > > > >         invalidate the TLB *AND* make
> > > > > >         > > > sure that no concurrently operation moves new stuff
> > > > > >         into the TLB.
> > > > > >         > > >
> > > > > >         > > > The problem is for this use case we have to use the
> > > > > >         heavyweight one.
> > > > > >         > > Just for my own curiosity: So the lightweight flush
> is
> > > > > >         only for in-between
> > > > > >         > > CS when you know access is idle? Or does that also
> not
> > > > > >         work if userspace
> > > > > >         > > has a CS on a dma engine going at the same time
> because
> > > > > >         the tlb aren't
> > > > > >         > > isolated enough between engines?
> > > > > >         >
> > > > > >         > More or less correct, yes.
> > > > > >         >
> > > > > >         > The problem is a lightweight flush only invalidates the
> > > > > >         TLB, but doesn't
> > > > > >         > take care of entries which have been handed out to the
> > > > > >         different engines.
> > > > > >         >
> > > > > >         > In other words what can happen is the following:
> > > > > >         >
> > > > > >         > 1. Shader asks TLB to resolve address X.
> > > > > >         > 2. TLB looks into its cache and can't find address X
> so it
> > > > > >         asks the walker
> > > > > >         > to resolve.
> > > > > >         > 3. Walker comes back with result for address X and TLB
> puts
> > > > > >         that into its
> > > > > >         > cache and gives it to Shader.
> > > > > >         > 4. Shader starts doing some operation using result for
> > > > > >         address X.
> > > > > >         > 5. You send lightweight TLB invalidate and TLB throws
> away
> > > > > >         cached values for
> > > > > >         > address X.
> > > > > >         > 6. Shader happily still uses whatever the TLB gave to
> it in
> > > > > >         step 3 to
> > > > > >         > accesses address X
> > > > > >         >
> > > > > >         > See it like the shader has their own 1 entry L0 TLB
> cache
> > > > > >         which is not
> > > > > >         > affected by the lightweight flush.
> > > > > >         >
> > > > > >         > The heavyweight flush on the other hand sends out a
> > > > > >         broadcast signal to
> > > > > >         > everybody and only comes back when we are sure that an
> > > > > >         address is not in use
> > > > > >         > any more.
> > > > > >
> > > > > >         Ah makes sense. On intel the shaders only operate in VA,
> > > > > >         everything goes
> > > > > >         around as explicit async messages to IO blocks. So we
> don't
> > > > > >         have this, the
> > > > > >         only difference in tlb flushes is between tlb flush in
> the IB
> > > > > >         and an mmio
> > > > > >         one which is independent for anything currently being
> > > > > >         executed on an
> > > > > >         egine.
> > > > > >         -Daniel
> > > > > >         --         Daniel Vetter
> > > > > >         Software Engineer, Intel Corporation
> > > > > >         http://blog.ffwll.ch <http://blog.ffwll.ch>
> > > > > >
> > > > >
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #2: Type: text/html, Size: 14657 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-06-17 19:23 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27 21:51 Linux Graphics Next: Userspace submission update Marek Olšák
2021-05-28 14:41 ` Christian König
2021-05-28 22:25   ` Marek Olšák
2021-05-29  3:33     ` Marek Olšák
2021-05-31  8:25       ` Christian König
2021-06-01  9:02 ` Michel Dänzer
2021-06-01 10:21   ` Christian König
2021-06-01 10:49     ` Michel Dänzer
2021-06-01 12:10       ` [Mesa-dev] " Christian König
2021-06-01 12:30         ` Daniel Vetter
2021-06-01 12:51           ` Christian König
2021-06-01 13:01             ` Marek Olšák
2021-06-01 13:24               ` Michel Dänzer
2021-06-02  8:57             ` Daniel Stone
2021-06-02  9:34               ` Marek Olšák
2021-06-02  9:38                 ` Marek Olšák
2021-06-02 18:48                   ` Daniel Vetter
2021-06-02 18:52                     ` Christian König
2021-06-02 19:19                       ` Daniel Vetter
2021-06-04  7:00                         ` Christian König
2021-06-04  8:57                           ` Daniel Vetter
2021-06-04 11:27                             ` Christian König
2021-06-09 13:19                               ` Daniel Vetter
2021-06-09 13:58                                 ` Christian König
2021-06-09 18:31                                   ` Daniel Vetter
2021-06-10 15:59                                     ` Marek Olšák
2021-06-10 16:33                                       ` Christian König
2021-06-14 17:10                                         ` Marek Olšák
2021-06-14 17:13                                           ` Christian König
2021-06-17 16:48                                             ` Daniel Vetter
2021-06-17 18:28                                               ` Marek Olšák
2021-06-17 19:04                                                 ` Daniel Vetter
2021-06-17 19:23                                                   ` Marek Olšák
2021-06-03  3:16                     ` Marek Olšák
2021-06-03  7:47                       ` Daniel Vetter
2021-06-03  8:20                         ` Marek Olšák
2021-06-03 10:03                           ` Daniel Vetter
2021-06-03 10:55                             ` Marek Olšák
2021-06-03 11:22                               ` Daniel Vetter
2021-06-03 17:52                                 ` Marek Olšák
2021-06-03 19:18                                   ` Daniel Vetter
2021-06-04  5:26                                     ` Marek Olšák
2021-06-02  9:44               ` Christian König
2021-06-02  9:58                 ` Marek Olšák
2021-06-02 10:06                   ` Christian König
2021-06-01 13:18         ` Michel Dänzer
2021-06-01 17:39           ` Michel Dänzer
2021-06-01 17:42           ` Daniel Stone
2021-06-02  8:09       ` Michel Dänzer
2021-06-02 19:20         ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.